HUD Documentation — Evaluations and RL Environments.

Build custom agents that interact with MCP tools to complete tasks. An agent is essentially a loop that calls your LLM, executes tools based on its decisions, and continues until the task is complete.

How Agents Work

An agent follows this lifecycle: The agent keeps calling your LLM and executing tools until the LLM stops requesting tools, indicating the task is complete.

The Four Required Methods

To create an agent, you implement four methods that bridge your LLM with MCP’s tool system:

from hud.agents import MCPAgent
from hud.types import AgentResponse, MCPToolCall, MCPToolResult

class MyAgent(MCPAgent):
    """Your custom agent implementation."""

    async def get_system_messages(self) -> list[Any]:
        """1. Called ONCE at start - returns your LLM's system prompt."""
        pass

    async def get_response(self, messages: list[Any]) -> AgentResponse:
        """2. Called EACH TURN - sends messages to your LLM, returns its response, optionally adds the assistant message to messages."""
        pass

    async def format_blocks(self, blocks: list[ContentBlock]) -> list[Any]:
        """3. Called at START - converts initial prompt/context to your LLM format."""
        pass

    async def format_tool_results(
        self, tool_calls: list[MCPToolCall],
        tool_results: list[MCPToolResult]
    ) -> list[Any]:
        """4. Called AFTER TOOLS - converts tool results to your LLM format."""
        pass

Understanding When Each Method is Called

The agent loop calls your methods in this sequence:

get_system_messages() - Once at start
format_blocks() - Converts initial task prompt
get_response() - Gets LLM decision, adds assistant message to messages
format_tool_results() - After each tool execution
Back to step 3 until done

What MCPAgent Does For You

The Agent Loop

The base MCPAgent class handles the entire execution loop. When you call agent.run(task):

Initialization Phase
- Connects to MCP servers (auto-creates client from task.mcp_config if needed)
- Discovers available tools from all connected servers
- Applies tool filtering (allowed/disallowed lists)
- Identifies lifecycle tools (setup, evaluate, response)
Setup Phase (if task.setup_tool provided)
- Executes setup tools (e.g., navigate to website, initialize environment)
- Optionally appends setup output to initial context (controlled by append_setup_output)
- Can include initial screenshots (controlled by initial_screenshot)

Main Execution Loop

while not done and step < max_steps:
    # Your get_response() is called here
    response = await agent.get_response(messages)

    if response.tool_calls:
        # MCPAgent executes tools for you
        results = await agent.call_tools(response.tool_calls)

        # Your format_tool_results() is called here
        messages.extend(await agent.format_tool_results(tool_calls, results))
    else:
        done = True

Evaluation Phase (if task.evaluate_tool provided)
- Runs evaluation tools to calculate reward
- Extracts reward from result (looks for “reward”, “grade”, “score” keys)
- Returns Trace object with full execution history

Tool Management

Tool Discovery & Filtering

agent = ClaudeAgent(
    allowed_tools=["anthropic_computer"],  # Only these tools
    disallowed_tools=["openai_computer"],  # Never these tools
)

Available Tools: Retrieved via self.get_available_tools() - already filtered
Lifecycle Tools: Automatically detected and hidden from your LLM
Response Tools: Auto-detected (tools with “response” in name) for task completion

Client Management

MCPAgent handles complex client lifecycle:

# Option 1: Provide your own client
from hud.clients import MCPClient
client = MCPClient(mcp_config={...})
agent = MyAgent(mcp_client=client)

# Option 2: Auto-create from task
task = Task(mcp_config={...})
agent = MyAgent()  # No client needed
await agent.run(task)  # Client created automatically

Auto-cleanup: Clients created automatically are properly shut down after execution.

Error Handling

MCPAgent provides robust error handling:

Connection Errors: Helpful messages about MCP server availability
Tool Errors: Captured and returned as MCPToolResult with isError=True
Timeout Handling: Graceful shutdown on tool execution timeouts
Trace Always Returns: Even on errors, you get a Trace object with details

Message Accumulation

Messages build up over the conversation:

[System] → [User Prompt] → [LLM Response] → [Tool Results] → [LLM Response] → ...

Your get_response() receives the full conversation history each time, allowing your LLM to maintain context.

Advanced Features

Response Agent Integration

from hud.agents.misc import ResponseAgent

agent = MyAgent(
    response_agent=ResponseAgent()  # Auto-decides when to stop/continue
)

The ResponseAgent can analyze ambiguous LLM responses like “Should I submit?” and decide whether to continue. Telemetry & Tracing

agent = MyAgent(
    auto_trace=True,  # Automatic span creation
    verbose=True  # Detailed logging
)

System Prompt Augmentation

task = Task(
    system_prompt="Additional instructions...",  # Appended to agent's system prompt
    ...
)

Testing Your Agent

Test your agent on a simple task:

import asyncio
import hud
import os
from hud.datasets import Task

async def test_agent():
    with hud.trace("test-custom-agent"):
        task = Task(
            prompt="Navigate to example.com",
            mcp_config={
                "hud": {
                    "url": "https://mcp.hud.so/v3/mcp",
                    "headers": {
                        "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
                        "Mcp-Image": "hudpython/hud-remote-browser:latest"
                    }
                }
            },
            setup_tool={
                "name": "setup",
                "arguments": {
                    "name": "navigate",
                    "arguments": {"url": "https://example.com"}
                }
            },
            evaluate_tool={
                "name": "evaluate",
                "arguments": {
                    "name": "url_match",
                    "arguments": {"pattern": "example.com"}
                }
            }
        )
        
        # Use your custom agent
        agent = MyAgent()
        result = await agent.run(task)
        print(f"Reward: {result.reward}")

asyncio.run(test_agent())

Built-in Agents

HUD provides built-in agents for common LLM providers:

from hud.agents import ClaudeAgent, OperatorAgent

# Claude (Anthropic)
claude_agent = ClaudeAgent(
    model="claude-sonnet-4-20250514",
)

# Operator (OpenAI-based)
operator_agent = OperatorAgent()

Always test your agent with the actual MCP servers you’ll use in production.

Next Steps

Benchmarks

Create, run, and publish evaluations

Agents Reference

API details and built-in agents

Get Started

Ideas

Environments

RL

Agents

CLI Reference

SDK Reference

Create Agents

How Agents Work

The Four Required Methods

Understanding When Each Method is Called

What MCPAgent Does For You

The Agent Loop

Tool Management

Client Management

Error Handling

Message Accumulation

Advanced Features

Testing Your Agent

Built-in Agents

Next Steps

Benchmarks

Agents Reference

See Also

Get Started

Ideas

Environments

RL

Agents

CLI Reference

SDK Reference

​How Agents Work

​The Four Required Methods

​Understanding When Each Method is Called

​What MCPAgent Does For You

​The Agent Loop

​Tool Management

​Client Management

​Error Handling

​Message Accumulation

​Advanced Features

​Testing Your Agent

​Built-in Agents

​Next Steps

Benchmarks

Agents Reference

​See Also

How Agents Work

The Four Required Methods

Understanding When Each Method is Called

What MCPAgent Does For You

The Agent Loop

Tool Management

Client Management

Error Handling

Message Accumulation

Advanced Features

Testing Your Agent

Built-in Agents

Next Steps

See Also