Task
class for defining agent objectives and dataset utilities for managing task collections.
Task Class
Field | Type | Description | Default |
---|---|---|---|
id | str | None | Unique identifier (UUID recommended) | None |
prompt | str | Task instruction for the agent | Required |
mcp_config | dict[str, Any] | MCP server configuration | Required |
setup_tool | MCPToolCall | list[MCPToolCall] | None | Tool(s) to prepare environment | None |
evaluate_tool | MCPToolCall | list[MCPToolCall] | None | Tool(s) to score performance | None |
agent_config | dict[str, Any] | None | Agent configuration (system_prompt, allowed_tools, etc.) | None |
metadata | dict[str, Any] | Extra task metadata | {} |
Environment Variable Substitution
Themcp_config
field automatically resolves environment variables using ${VAR_NAME}
syntax:
Template.substitute()
with a defaultdict that returns empty strings for missing variables.
Field Validators
Task automatically:- Parses JSON strings -
mcp_config
andmetadata
can be JSON strings - Converts dicts to MCPToolCall -
setup_tool
andevaluate_tool
dicts are converted - Resolves environment variables - Only when created from dict (preserves templates in model_dump())
Recommended Evaluation Workflow
When developing and testing agents, follow this progression for optimal debugging and performance:Step 1: Single Task Development
Start with individual tasks to debug your agent and environment setup:- Full error stack traces
- Clear log output
- Quick iteration cycle
- Easy to debug environment issues
Step 2: Full Dataset Evaluation
Once single tasks work reliably, scale up to full dataset evaluation:- See all agent logs in CLI in real-time
- Catch intermittent environment issues
- Validate agent behavior at scale
- Efficient asyncio-based concurrency
- Start with
max_concurrent=50
and adjust based on results - Increase to 100-200 for faster evaluation (if API limits allow)
- Decrease to 10-20 if hitting rate limits
- Most datasets work well with 50-100 concurrent tasks
Quick Reference
Stage | Method | Concurrency | Use Case | Debugging |
---|---|---|---|---|
Development | Single task | 1 | Initial debugging | Excellent |
Production | run_dataset | 50-200 | Full evaluation | Good |
Dataset Functions
run_dataset
Parameter | Type | Description | Default |
---|---|---|---|
name | str | Job name for tracking | Required |
dataset | str | Dataset | list[dict] | HF dataset ID, Dataset object, or task dicts | Required |
agent_class | type[MCPAgent] | Agent class to instantiate | Required |
agent_config | dict[str, Any] | None | Constructor kwargs for agent | None |
max_concurrent | int | Maximum concurrent tasks (1-200 recommended) | 30 |
metadata | dict[str, Any] | None | Job metadata | None |
max_steps | int | Max steps per task | 10 |
split | str | Dataset split when loading by ID | "train" |
auto_respond | bool | Use ResponseAgent for continuations | False |
list[Any]
- Results in dataset order
Features:
- Automatic job tracking with
hud.async_job()
- Per-task trace tracking with
hud.async_trace()
- Efficient asyncio-based concurrency with semaphores
- Automatic telemetry flushing
- Progress tracking with individual traces
- Graceful error handling (continues on failures)
fetch_system_prompt_from_dataset
system_prompt.txt
from a HuggingFace dataset repository.
Returns: str | None
- System prompt text if found
Note: Requires huggingface_hub
to be installed.
save_tasks
Parameter | Type | Description | Default |
---|---|---|---|
tasks | list[dict[str, Any]] | Task dictionaries (NOT Task objects) | Required |
repo_id | str | HuggingFace repository ID | Required |
**kwargs | Any | Additional args for push_to_hub() | - |
mcp_config
→ JSON stringsetup_tool
→ JSON string (if present)evaluate_tool
→ JSON string (if present)metadata
→ JSON string (if present)
MCPToolCall Type
Field | Type | Description | Default |
---|---|---|---|
name | str | Tool name to call | Required |
arguments | dict[str, Any] | Tool arguments | {} |
Real-World Examples
Loading Tasks from Datasets
Fromexamples/run_evaluation.py
:
Task Structure in Datasets
Fromenvironments/text_2048/2048_taskconfigs.json
:
Creating and Saving Tasks
Agent Integration
Tasks automatically configure agents:Agent Config Options
Theagent_config
field supports the following options:
Option | Type | Description |
---|---|---|
system_prompt | str | Custom system prompt appended to agent’s default |
allowed_tools | list[str] | Tools the agent can use (replaces agent_tools ) |
disallowed_tools | list[str] | Tools to exclude from the agent |
append_setup_output | bool | Include setup output in first message (default: True) |
initial_screenshot | bool | Take screenshot before first action (default: True) |
Best Practices
- Use UUIDs for task IDs - Required for HuggingFace datasets
- Save dictionaries, not objects - Preserves env var templates
- Use agent_config for agent settings - Centralize agent configuration in one place
- Use metadata for filtering - Category, difficulty, tags
- Test locally first - Before uploading to HuggingFace
- Version your datasets - Use meaningful repo names
Common Patterns
Filtering Tasks
Custom System Prompts
Environment Variable Management
See Also
- Task System - Conceptual overview
- Benchmarks - Building and running datasets
- Agents - How agents use tasks