Tasks
Design, build, and share evaluation scenarios for browser-based agents
Creating Tasks & TaskSets
Tasks define what browser-based agents should accomplish and how success is measured. TaskSets group these tasks for benchmarking and sharing.
Core Task Workflow
- Define Task: Specify prompt, setup, and evaluation criteria for a specific environment.
- Test Locally: Use
gym.make(task)
andenv.run(agent)
to iterate on your task. - (Optional) Group into TaskSet: Collect related tasks for benchmarking or organized evaluation.
- (Optional) Upload TaskSet: Share your TaskSet on the HUD platform.
Task Structure
While tasks can be designed for various environments, this guide focuses on tasks for the hud-browser
.
For creating tasks that operate in specialized Docker environments (e.g., desktop applications, custom web apps), please see the Environment Creation & Contribution Guide.
Setup Functions (for hud-browser
)
Function | Description |
---|---|
goto(url) | Navigates to a URL. |
load_html_content(html) | Loads static HTML content into the browser. |
sheets_from_xlsx(url) | Downloads XLSX, converts to Google Sheet, and navigates to it. |
Actions such as clicking elements, typing text, complex page manipulations, or waiting for dynamic content to appear after an initial page load are typically performed by the agent via env.step()
calls during the task execution, not as part of the initial Task.setup
.
For a detailed list of all setup functions available in the hud-browser
environment and their usage examples, please see the Browser Environment Setup Functions Documentation.
Evaluate Functions (Verifying Task Success)
Evaluate functions are called by env.evaluate()
after the agent has completed its interactions (or reached a step limit) to determine if the task objectives were met.
For hud-browser
tasks, evaluation functions commonly check page content, URL, browser state, or the agent’s actions and final response:
Category | Common Functions |
---|---|
Content | page_contains , element_exists , text_matches |
URL/Navigation | url_contains , url_match |
Browser State | cookie_exists |
Agent Response | response_includes |
Action History | selector_history , verify_type_action , history_length , raw_last_action_is |
Spreadsheets | sheets_cell_values |
For a detailed list of all evaluation functions available in the hud-browser
environment, their parameters, and usage examples, please see the Browser Environment Evaluate Functions Documentation.
TaskSet Creation & Management
TaskSets are collections of related Task
objects, useful for running benchmarks, organizing evaluations, or sharing common scenarios.
Creating a TaskSet
Uploading & Publishing TaskSets
Once created, you can upload your TaskSet to the HUD platform to make it available for yourself, your team, or the public.
Publishing and Sharing
Once uploaded, TaskSets can be:
- Private: Visible only to you by default.
- Public: Optionally publish to the wider HUD community.
- Shared with Team: (Coming Soon) Share within your HUD organization.
Uploaded TaskSets are managed at app.hud.so/evalsets.
Pre-built TaskSets
Load and run existing benchmarks:
Available TaskSets on hud:
- WebVoyager: Web navigation and complex interaction.
- Mind2Web: Tasks on real-world websites.
- GAIA: Challenging reasoning and multi-hop QA.
- OSWorld-Ubuntu: Desktop environment tasks (requires custom OS environments).
- hud-samples: Introductory examples to get started.
MCP Telemetry with Tasks
When using MCP-enabled agents, HUD automatically traces tool calls made during task execution if wrapped in hud.trace()
:
Best Practices for Task Design
- Clear Prompts: Ensure agent understands the goal and success criteria.
- Atomic Tasks: Break down complex goals into smaller, testable tasks.
- Robust Setup: Create consistent starting states.
- Comprehensive Evaluation: Use multiple evaluation functions to validate success.
- Iterate: Test and refine tasks, especially evaluation logic.
Advanced Patterns
Environment-Specific Evaluation
Dynamic Task Generation
Related Guides
- Browser Environment: Detailed guide on using
hud-browser
, including all its setup and evaluation functions. - Environment Creation & Contribution: For tasks requiring specialized Docker-based environments.
- Quickstart: Introductory examples and common usage patterns.
- API Reference: Comprehensive details for all SDK modules and classes.