Tasks define what agents should do and how to measure success. They combine a prompt, environment configuration, and optional setup/evaluation phases.
Lifecycle
Task Structure
from hud.datasets import Task
import uuid
task = Task(
# Required fields
prompt="Navigate to the login page and sign in as testuser@example.com",
mcp_config={
"hud": {
"url": "https://mcp.hud.so/v3/mcp",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Mcp-Image": "hudpython/hud-browser:latest"
}
}
},
# Optional fields
id=str(uuid.uuid4()), # Required for HuggingFace datasets
system_prompt="You are an expert web automation agent. Always verify page loads before interacting with elements.",
setup_tool={
"name": "playwright",
"arguments": {
"action": "navigate",
"url": "https://example.com"
}
},
evaluate_tool={
"name": "evaluate",
"arguments": {
"name": "url_contains",
"substring": "/dashboard"
}
},
metadata={"category": "authentication", "difficulty": "easy"}
)
Field Reference
The instruction given to the agent. Be specific and clear about success criteria.
Environment connection configuration. Supports environment variables with ${VAR}
syntax.
Unique identifier. Required when creating HuggingFace datasets. Use str(uuid.uuid4())
.
Custom system prompt for the agent. Overrides the agent’s default system prompt.
Tool(s) to initialize environment state with name
and arguments
Tool(s) to score performance with name
and arguments
Additional task information for analysis and filtering
Environment Variables
# In dataset JSON:
{
"prompt": "Complete the TODO list",
"mcp_config": {
"hud": {
"url": "https://mcp.hud.so/v3/mcp",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Mcp-Image": "${BROWSER_IMAGE}"
}
}
}
}
# When loaded:
task = Task(**task_dict) # Variables resolved here!
# Now task.mcp_config["hud"]["headers"]["Authorization"] = "Bearer sk-hud-..."
This enables:
- Public datasets without exposing secrets
- Environment-specific configurations
- CI/CD pipelines with different credentials
Running Tasks
# Agent automatically handles all phases
result = await agent.run(task)
print(f"Success: {result.reward}") # 0.0 to 1.0
The agent will:
- Execute
setup_tool
if provided
- Work on the
prompt
using available tools
- Execute
evaluate_tool
to calculate reward
Working with Datasets
Tasks integrate with HuggingFace datasets:
from datasets import load_dataset
from hud.datasets import run_dataset
# Load dataset from HuggingFace
dataset = load_dataset("hud-evals/SheetBench-50", split="train")
# Run agent on entire dataset with automatic parallelization
results = await run_dataset(
"SheetBench Run",
dataset, # Pass the dataset object or name (e.g. "hud-evals/SheetBench-50")
agent_class=ClaudeAgent
)
Creating Datasets
from hud.datasets import save_tasks
# Create task dictionaries (NOT Task objects!)
task_dicts = [
{
"prompt": "Navigate to the login page",
"mcp_config": {
"hud": {
"url": "${MCP_URL}",
"headers": {"Authorization": "Bearer ${HUD_API_KEY}"}
}
},
"setup_tool": {"name": "playwright", "arguments": {"action": "navigate", "url": "https://example.com"}},
"evaluate_tool": {"name": "url_match", "arguments": {"pattern": ".*/login"}}
},
# More task dicts...
]
# Save to HuggingFace (preserves ${VAR} templates)
save_tasks(task_dicts, "my-org/my-benchmark")
Always save dictionaries, not Task objects. Task objects have already resolved environment variables!
Best Practices
- Use UUIDs: Always include
id=str(uuid.uuid4())
for dataset tasks
- Clear Prompts: Be specific about success criteria
- Template Variables: Use
${VAR}
syntax for shareable configs
- Rich Tools: Include both
name
and arguments
in tool definitions
Next Steps