Tasks and TaskSets
Defining goals, setup, and evaluation scenarios with Tasks and TaskSets
Tasks and TaskSets
Tasks define what an Agent should do in an Environment, including the goal, initial setup steps, and how to evaluate success. TaskSets are collections of related tasks.
Task
A Task
object provides the configuration for a specific scenario.
Key Attributes
prompt
(str): The primary instruction given to the agent.gym
(str |CustomGym
| None): Specifies the type of Environment needed. Used byhud.gym.make()
.setup
(FunctionConfigs
| None): Defines actions executed before the agent starts. See Setup Configuration.evaluate
(FunctionConfigs
| None): Defines how to check if the agent succeeded after interaction. See Evaluation Configuration.id
(str | None): Optional identifier.metadata
(dict | None): Optional dictionary for extra information.config
(dict | None): Optional dictionary, primarily for remote execution.
Creating a Task
Configuration Styles (setup
and evaluate
)
Both setup
and evaluate
accept configurations defining function calls within the environment’s controller, using flexible formats (FunctionConfigs
):
- String:
"browser.maximize"
- Tuple:
("goto", "https://google.com")
- Dictionary:
{"function": "wait_for_element", "args": ["#submit"]}
- List:
[("goto", "page1"), ("click", "#next")]
(Executed sequentially)
Setup Configuration (setup
)
- Purpose: Establishes a consistent starting state before the agent interacts.
- Execution: Automatically run by
hud.gym.make(task)
. Can be run manually viaenv._setup()
. - Examples: Navigating to a URL, logging in, preparing files.
Evaluation Configuration (evaluate
)
- Purpose: Determines task success after the agent finishes.
- Execution: Triggered by
await env.evaluate()
. - Result: The return value of
env.evaluate()
, often a reward score (e.g.,1.0
or0.0
). This is stored in thereward
field of the Trajectory if linked to a Job. - Examples:
- Interactive:
("contains_text", "Success!")
,("file_exists", "/path/to/output.txt")
. These typically call functions within the active environment controller. - QA:
("response_includes", "Paris")
. These functions often check the text stored inenv.final_response
(which comes from the agent’sResponseAction
).
- Interactive:
- Note: Check specific environment or evaluation service documentation for available functions.
TaskSet
A TaskSet
is a list of related Task
objects, useful for benchmarks.
Key Attributes
tasks
(list[Task
]): The list of tasks.id
(str | None): Optional identifier.description
(str | None): Optional description.
Loading a TaskSet
Load predefined sets from the HUD platform:
Currently supported TaskSets available via load_taskset
include OSWorld, GAIA, and WebVoyager subsets.
Creating a TaskSet Manually
Related Concepts
- Environment: Where Tasks are executed and evaluated.
- Agent: Aims to complete the Task
prompt
. - Job: Groups runs of different Tasks.
- Trajectory: Records the execution of a Task.
Defining Question-Answering (QA) Tasks
While HUD excels at interactive tasks, you can also define tasks that are primarily question-answering. The key differences are:
gym
: You might still use an existing environment type like"hud-browser"
if you want the QA to happen within that context (e.g., asking the agent to answer based on a webpage). For pure QA without environment interaction, a future specific"qa"
gym type might be introduced, but currently, you’d use an existing type.prompt
: Contains the question for the agent.setup
: Often minimal or unnecessary for pure QA.evaluate
: Defines how to check the agent’s final text answer. This typically involves calling a specific evaluation function that compares the agent’s final submitted response (seeResponseAction
in CLA Details) against expected criteria. Theenv.final_response
attribute holds the text submitted by the agent viaResponseAction
.target
: (Recommended) Store the ground truth answer in themetadata
or potentially a dedicatedtarget
field for clarity during evaluation function design.
The Agent handling such a task should recognize it doesn’t need complex interaction and output a ResponseAction
containing the final answer. The env.evaluate()
call then triggers the specified check (like response_includes
) against the stored response.