Skip to main content
The HUD SDK provides the Task class for defining agent objectives and dataset utilities for managing task collections.

Task Class

from hud.datasets import Task
Pydantic model that defines an agent’s objective, setup, and evaluation criteria. Fields:
FieldTypeDescriptionDefault
idstr | NoneUnique identifier (UUID recommended)None
promptstrTask instruction for the agentRequired
mcp_configdict[str, Any]MCP server configurationRequired
setup_toolMCPToolCall | list[MCPToolCall] | NoneTool(s) to prepare environmentNone
evaluate_toolMCPToolCall | list[MCPToolCall] | NoneTool(s) to score performanceNone
agent_configdict[str, Any] | NoneAgent configuration (system_prompt, allowed_tools, etc.)None
metadatadict[str, Any]Extra task metadata{}

Environment Variable Substitution

The mcp_config field automatically resolves environment variables using ${VAR_NAME} syntax:
task = Task(
    prompt="Navigate to the dashboard",
    mcp_config={
        "browser": {
            "url": "${HUD_MCP_URL:https://mcp.hud.so/v3/mcp}",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "hudpython/hud-browser:latest"
            }
        }
    }
)
Substitution happens automatically when Task is created from a dict - environment variables are resolved using Python’s Template.substitute() with a defaultdict that returns empty strings for missing variables.

Field Validators

Task automatically:
  1. Parses JSON strings - mcp_config and metadata can be JSON strings
  2. Converts dicts to MCPToolCall - setup_tool and evaluate_tool dicts are converted
  3. Resolves environment variables - Only when created from dict (preserves templates in model_dump())
# All of these work:
task = Task(
    prompt="Test",
    mcp_config='{"server": {"url": "..."}}',  # JSON string → dict
    setup_tool='{"name": "setup"}'            # JSON string → MCPToolCall
)

task = Task(
    prompt="Test", 
    mcp_config={...},
    setup_tool={"name": "setup", "arguments": {...}}  # dict → MCPToolCall
)

task = Task(
    prompt="Test",
    mcp_config={...},
    evaluate_tool=[  # List of tool calls
        {"name": "check_text", "arguments": {"text": "Success"}},
        {"name": "check_url", "arguments": {"url": "example.com"}}
    ]
)
When developing and testing agents, follow this progression for optimal debugging and performance:

Step 1: Single Task Development

Start with individual tasks to debug your agent and environment setup:
# Test one task at a time for full debugging visibility
from hud.agents import ClaudeAgent
from hud.datasets import Task

task = Task(
    prompt="Click the submit button",
    mcp_config={...}
)

agent = ClaudeAgent()
result = await agent.run(task, max_steps=50)
print(f"Result: {result.reward}")
Benefits:
  • Full error stack traces
  • Clear log output
  • Quick iteration cycle
  • Easy to debug environment issues

Step 2: Full Dataset Evaluation

Once single tasks work reliably, scale up to full dataset evaluation:
from hud.datasets import run_dataset

# Start with moderate concurrency
results = await run_dataset(
    name="SheetBench Evaluation",
    dataset="hud-evals/SheetBench-50",
    agent_class=ClaudeAgent,
    max_concurrent=50,  # Good starting point
    max_steps=50,
    auto_respond=True
)

# Scale up for larger datasets (adjust based on API rate limits)
results = await run_dataset(
    name="Large Dataset Evaluation",
    dataset="hud-evals/OSWorld-Verified-Gold",
    agent_class=ClaudeAgent,
    max_concurrent=100,  # Higher concurrency for speed
    max_steps=50,
    auto_respond=True
)
Benefits:
  • See all agent logs in CLI in real-time
  • Catch intermittent environment issues
  • Validate agent behavior at scale
  • Efficient asyncio-based concurrency
Concurrency Guidelines:
  • Start with max_concurrent=50 and adjust based on results
  • Increase to 100-200 for faster evaluation (if API limits allow)
  • Decrease to 10-20 if hitting rate limits
  • Most datasets work well with 50-100 concurrent tasks

Quick Reference

StageMethodConcurrencyUse CaseDebugging
DevelopmentSingle task1Initial debuggingExcellent
Productionrun_dataset50-200Full evaluationGood

Dataset Functions

run_dataset

from hud.datasets import run_dataset

# Run full dataset with asyncio-based concurrency
results = await run_dataset(
    name="SheetBench Evaluation",
    dataset="hud-evals/SheetBench-50",  # or Dataset object or list[dict]
    agent_class=ClaudeAgent,
    agent_config={"model": "claude-sonnet-4-20250514", "validate_api_key": False},
    max_concurrent=50,  # Adjust based on API rate limits (1-200)
    max_steps=50,
    auto_respond=True  # Automatically handle response continuations
)
Run all tasks with asyncio-based concurrency for optimal performance and debuggability. Parameters:
ParameterTypeDescriptionDefault
namestrJob name for trackingRequired
datasetstr | Dataset | list[dict]HF dataset ID, Dataset object, or task dictsRequired
agent_classtype[MCPAgent]Agent class to instantiateRequired
agent_configdict[str, Any] | NoneConstructor kwargs for agentNone
max_concurrentintMaximum concurrent tasks (1-200 recommended)30
metadatadict[str, Any] | NoneJob metadataNone
max_stepsintMax steps per task10
splitstrDataset split when loading by ID"train"
auto_respondboolUse ResponseAgent for continuationsFalse
Returns: list[Any] - Results in dataset order Features:
  • Automatic job tracking with hud.async_job()
  • Per-task trace tracking with hud.async_trace()
  • Efficient asyncio-based concurrency with semaphores
  • Automatic telemetry flushing
  • Progress tracking with individual traces
  • Graceful error handling (continues on failures)
Example:
# From HuggingFace dataset ID
results = await run_dataset(
    "SheetBench Evaluation",
    "hud-evals/SheetBench-50",
    ClaudeAgent,
    {"model": "claude-3-5-sonnet-20241022"}
)

# From Dataset object
from datasets import load_dataset
dataset = load_dataset("hud-evals/OSWorld-Verified", split="train")
results = await run_dataset("OSWorld Test", dataset, OperatorAgent)

# From list of dicts
tasks = [
    {
        "prompt": "Click submit",
        "mcp_config": {...},
        "setup_tool": {"name": "setup"}
    }
]
results = await run_dataset("Custom", tasks, ClaudeAgent)

fetch_system_prompt_from_dataset

from hud.datasets import fetch_system_prompt_from_dataset

system_prompt = await fetch_system_prompt_from_dataset("hud-evals/SheetBench-50")
Fetch system_prompt.txt from a HuggingFace dataset repository. Returns: str | None - System prompt text if found Note: Requires huggingface_hub to be installed.

save_tasks

from hud.datasets import save_tasks

# IMPORTANT: Pass dicts, not Task objects!
task_dicts = [task.model_dump() for task in tasks]
save_tasks(
    tasks=task_dicts,
    repo_id="my-org/my-tasks",
    private=False,
    dataset_name="My Tasks v1.0"
)
Save tasks to HuggingFace dataset with JSON string serialization. Parameters:
ParameterTypeDescriptionDefault
taskslist[dict[str, Any]]Task dictionaries (NOT Task objects)Required
repo_idstrHuggingFace repository IDRequired
**kwargsAnyAdditional args for push_to_hub()-
Important: Always pass dictionaries to preserve environment variable templates:
# ✅ GOOD - Preserves ${HUD_API_KEY} in the dataset
task_dicts = [task.model_dump() for task in tasks]
save_tasks(task_dicts, "my-dataset")

# ❌ BAD - Would expose resolved API keys!
# save_tasks(tasks, "my-dataset")  # Don't do this!
Dataset Schema: The function converts complex fields to JSON strings for clean HuggingFace schema:
  • mcp_config → JSON string
  • setup_tool → JSON string (if present)
  • evaluate_tool → JSON string (if present)
  • metadata → JSON string (if present)

MCPToolCall Type

from hud.types import MCPToolCall
Defines a tool invocation with arguments. Fields:
FieldTypeDescriptionDefault
namestrTool name to callRequired
argumentsdict[str, Any]Tool arguments{}

Real-World Examples

Loading Tasks from Datasets

From examples/run_evaluation.py:
# Load dataset
dataset = load_dataset("hud-evals/SheetBench-50", split="train")

# Get system prompt
dataset_system_prompt = await fetch_system_prompt_from_dataset("hud-evals/SheetBench-50")

# Run single task
sample_task = dataset[0]
task = Task(**sample_task)  # Auto-converts dict to Task

# Or run all tasks
results = await run_dataset(
    "Full SheetBench Evaluation",
    dataset,
    ClaudeAgent,
    {"model": "claude-3-5-sonnet-20241022"},
    max_concurrent=50
)

Task Structure in Datasets

From environments/text_2048/2048_taskconfigs.json:
{
    "id": "2048_target_256",
    "prompt": "You are playing 2048 and your goal is to reach the 256 tile.",
    "mcp_config": {
        "local": {
            "command": "docker",
            "args": ["run", "--rm", "-i", "hud-text-2048"]
        }
    },
    "setup_tool": {
        "name": "setup",
        "arguments": {
            "name": "board",
            "arguments": {"board_size": 4}
        }
    },
    "evaluate_tool": {
        "name": "evaluate", 
        "arguments": {
            "name": "max_number",
            "arguments": {"target": 256}
        }
    },
    "metadata": {
        "difficulty": 256,
        "game": "2048"
    }
}

Creating and Saving Tasks

import os
from uuid import uuid4

# Set environment variables
os.environ["HUD_API_KEY"] = "sk-hud-..."
os.environ["BROWSER_PROVIDER"] = "anchorbrowser"

# Create tasks with env var templates
tasks = []
for url in ["example.com", "test.com"]:
    task = Task(
        id=str(uuid4()),
        prompt=f"Navigate to {url} and find the login button",
        mcp_config={
            "browser": {
                "url": "${HUD_MCP_URL:https://mcp.hud.so/v3/mcp}",
                "headers": {
                    "Authorization": "Bearer ${HUD_API_KEY}",
                    "X-Provider": "${BROWSER_PROVIDER}"
                }
            }
        },
        setup_tool=MCPToolCall(
            name="setup",
            arguments={"name": "navigate", "arguments": {"url": url}}
        ),
        evaluate_tool=MCPToolCall(
            name="evaluate", 
            arguments={"name": "element_exists", "arguments": {"selector": "button.login"}}
        ),
        metadata={
            "category": "navigation",
            "difficulty": "easy"
        }
    )
    tasks.append(task)

# Save to HuggingFace (preserves env var templates)
task_dicts = [t.model_dump() for t in tasks]
save_tasks(task_dicts, "my-org/navigation-tasks")

Agent Integration

Tasks automatically configure agents:
task = Task(
    prompt="Complete the form",
    mcp_config={...},
    setup_tool={...},
    evaluate_tool={...},
    agent_config={
        "system_prompt": "Be very careful with form fields",
        "allowed_tools": ["computer", "search"],
        "append_setup_output": True
    }
)

agent = ClaudeAgent()

# When agent.run(task) is called:
# 1. If no mcp_client, creates one from task.mcp_config
# 2. Applies task.agent_config to agent (system_prompt, allowed_tools, etc.)
# 3. Adds lifecycle tools to agent.lifecycle_tools
# 4. Runs setup_tool (hidden from agent)
# 5. Executes task.prompt
# 6. Runs evaluate_tool (hidden from agent)
# 7. Returns Trace with evaluation reward

Agent Config Options

The agent_config field supports the following options:
OptionTypeDescription
system_promptstrCustom system prompt appended to agent’s default
allowed_toolslist[str]Tools the agent can use (replaces agent_tools)
disallowed_toolslist[str]Tools to exclude from the agent
append_setup_outputboolInclude setup output in first message (default: True)
initial_screenshotboolTake screenshot before first action (default: True)

Best Practices

  1. Use UUIDs for task IDs - Required for HuggingFace datasets
  2. Save dictionaries, not objects - Preserves env var templates
  3. Use agent_config for agent settings - Centralize agent configuration in one place
  4. Use metadata for filtering - Category, difficulty, tags
  5. Test locally first - Before uploading to HuggingFace
  6. Version your datasets - Use meaningful repo names

Common Patterns

Filtering Tasks

# Load all tasks
dataset = load_dataset("hud-evals/WebArena-Lite", split="train")

# Filter in Python
login_tasks = [
    task for task in dataset
    if "login" in task["prompt"].lower()
]

# Convert and run
results = await run_dataset(
    "Login Tasks Only",
    login_tasks,
    ClaudeAgent
)

Custom System Prompts

# Override dataset system prompt
results = await run_dataset(
    "Custom Evaluation",
    "hud-evals/SheetBench-50",
    ClaudeAgent,
    custom_system_prompt="You are an expert spreadsheet user. Be precise."
)

Environment Variable Management

# Tasks preserve templates
task = Task(
    prompt="Test",
    mcp_config={
        "server": {
            "api_key": "${API_KEY:default-key}"
        }
    }
)

# model_dump() preserves the template
data = task.model_dump()
assert data["mcp_config"]["server"]["api_key"] == "${API_KEY:default-key}"

# But task object has resolved value
os.environ["API_KEY"] = "real-key"
task2 = Task(**data)
# task2.mcp_config["server"]["api_key"] is now "real-key"

See Also

I