Defining goals, setup, and evaluation scenarios with Tasks and TaskSets
Tasks define what an Agent should do in an Environment, including the goal, initial setup steps, and how to evaluate success. TaskSets are collections of related tasks.
A Task
object provides the configuration for a specific scenario.
prompt
(str): The primary instruction given to the agent.gym
(str | CustomGym
| None): Specifies the type of Environment needed. Used by hud.gym.make()
.setup
(FunctionConfigs
| None): Defines actions executed before the agent starts. See Setup Configuration.evaluate
(FunctionConfigs
| None): Defines how to check if the agent succeeded after interaction. See Evaluation Configuration.id
(str | None): Optional identifier.metadata
(dict | None): Optional dictionary for extra information.config
(dict | None): Optional dictionary, primarily for remote execution.setup
and evaluate
)Both setup
and evaluate
accept configurations defining function calls within the environment’s controller, using flexible formats (FunctionConfigs
):
"browser.maximize"
("goto", "https://google.com")
{"function": "wait_for_element", "args": ["#submit"]}
[("goto", "page1"), ("click", "#next")]
(Executed sequentially)setup
)hud.gym.make(task)
. Can be run manually via env._setup()
.evaluate
)await env.evaluate()
.env.evaluate()
, often a reward score (e.g., 1.0
or 0.0
). This is stored in the reward
field of the Trajectory if linked to a Job.("contains_text", "Success!")
, ("file_exists", "/path/to/output.txt")
. These typically call functions within the active environment controller.("response_includes", "Paris")
. These functions often check the text stored in env.final_response
(which comes from the agent’s ResponseAction
).A TaskSet
is a list of related Task
objects, useful for benchmarks.
tasks
(list[Task
]): The list of tasks.id
(str | None): Optional identifier.description
(str | None): Optional description.Load predefined sets from the HUD platform:
Currently supported TaskSets available via load_taskset
include OSWorld, GAIA, and WebVoyager subsets.
prompt
.While HUD excels at interactive tasks, you can also define tasks that are primarily question-answering. The key differences are:
gym
: You might still use an existing environment type like "hud-browser"
if you want the QA to happen within that context (e.g., asking the agent to answer based on a webpage). For pure QA without environment interaction, a future specific "qa"
gym type might be introduced, but currently, you’d use an existing type.prompt
: Contains the question for the agent.setup
: Often minimal or unnecessary for pure QA.evaluate
: Defines how to check the agent’s final text answer. This typically involves calling a specific evaluation function that compares the agent’s final submitted response (see ResponseAction
in CLA Details) against expected criteria. The env.final_response
attribute holds the text submitted by the agent via ResponseAction
.target
: (Recommended) Store the ground truth answer in the metadata
or potentially a dedicated target
field for clarity during evaluation function design.The Agent handling such a task should recognize it doesn’t need complex interaction and output a ResponseAction
containing the final answer. The env.evaluate()
call then triggers the specified check (like response_includes
) against the stored response.
setup
and evaluate
)