# Adapting Software
Source: https://docs.hud.so/build-environments/docker-to-mcp

Patterns for wrapping existing applications using HUD SDK tools

Learn how to adapt existing software into MCP environments using patterns from the HUD SDK. All examples link to actual implementations in [hud-evals/hud-python](https://github.com/hud-evals/hud-python).

## Browser Automation Pattern

The HUD SDK provides powerful computer tools that can control any browser-based application. See how [remote\_browser](https://github.com/hud-evals/hud-python/tree/main/environments/remote_browser) implements this pattern.

### Using HUD Computer Tools

```python
from hud.tools.computer import HudComputerTool
from hud.tools.executors.base import BaseExecutor

# Browser executor wraps Playwright
class BrowserExecutor(BaseExecutor):
    def __init__(self, playwright_tool):
        self.browser = playwright_tool
    
    async def screenshot(self) -> str:
        # Return base64 screenshot from browser
        return await self.browser.screenshot()
    
    async def click(self, x: int, y: int):
        # Translate to browser coordinates
        await self.browser.click(x, y)

# Use with computer tool
executor = BrowserExecutor(playwright_tool)
computer_tool = HudComputerTool(executor=executor)
```

See the full implementation:

* [BrowserExecutor](https://github.com/hud-evals/hud-python/blob/main/environments/remote_browser/src/hud_controller/tools/executor.py) - Wraps Playwright with computer tool interface
* [Server integration](https://github.com/hud-evals/hud-python/blob/main/environments/remote_browser/src/hud_controller/server.py#L202-L205) - How tools are registered

### Multiple Provider Support

```python
# Support different cloud browsers
provider = os.getenv("BROWSER_PROVIDER", "anchorbrowser")
browser_provider = get_provider(provider)

# Each provider implements the same interface
cdp_url = await browser_provider.launch()
browser = await playwright.connect_over_cdp(cdp_url)
```

See [providers directory](https://github.com/hud-evals/hud-python/tree/main/environments/remote_browser/src/hud_controller/providers) for implementations.

## Game Environment Pattern

Simple stateful applications like games use custom tools. See [text\_2048](https://github.com/hud-evals/hud-python/tree/main/environments/text_2048) for the pattern.

### Custom Tool Implementation

```python
from hud.tools.base import BaseTool
from mcp.types import ContentBlock, TextContent

class MoveTool(BaseTool):
    """Custom tool for game moves."""
    
    def __init__(self, env):
        super().__init__(
            name="move",
            description="Move tiles in a direction",
            parameters={
                "type": "object",
                "properties": {
                    "direction": {"type": "string", "enum": ["up", "down", "left", "right"]}
                },
                "required": ["direction"]
            }
        )
        self.env = env
    
    async def __call__(self, direction: str) -> list[ContentBlock]:
        """Execute the move - this is what BaseTool calls."""
        # Perform the move
        self.env.move(direction)
        
        # Return visual state
        board_display = self.env.get_board_ascii()
        return [TextContent(text=board_display, type="text")]
```

See [MoveTool implementation](https://github.com/hud-evals/hud-python/blob/main/environments/text_2048/src/hud_controller/tools/move.py).

## Docker Base Images

HUD provides optimized base images for common use cases:

### Browser Automation

```dockerfile
FROM hudpython/browser-base:latest
# Includes Chromium, Playwright, and dependencies

# Add your automation code
COPY src/ /app/src/
CMD ["python", "-m", "hud_controller.server"]
```

### Desktop GUI

```dockerfile
FROM hudpython/novnc-base:latest
# Ubuntu with XFCE, VNC, and noVNC pre-configured

# Install your GUI application
RUN apt-get update && apt-get install -y \
    your-gui-app

# VNC is available on port 5901, noVNC on 8080
```

### Custom Base Images

When building from scratch:

```dockerfile
FROM python:3.11-slim

# Install system dependencies
RUN apt-get update && apt-get install -y \
    required-packages \
    && rm -rf /var/lib/apt/lists/*

# Python environment
ENV PYTHONUNBUFFERED=1 \
    PYTHONDONTWRITEBYTECODE=1

WORKDIR /app
```

See [base image examples](https://github.com/hud-evals/hud-python/tree/main/environments) in environment Dockerfiles.

## Environment Context Pattern

Share state between tools using environment context:

```python
# Define context class
class BrowserEnvironmentContext:
    def __init__(self, browser):
        self.browser = browser
        self.page = None
    
    async def goto(self, url: str):
        self.page = await self.browser.new_page()
        await self.page.goto(url)

# Share with tools in server.py
@mcp.initialize
async def initialize_environment(ctx):
    browser = await launch_browser()
    env = BrowserEnvironmentContext(browser)
    
    # Make env available to all tools
    global environment
    environment = env
```

## Common Return Types

All tools should return standardized types for consistency.

### ContentBlock for Tool Output

```python
from mcp.types import ContentBlock, TextContent, ImageContent

# Text output
return [TextContent(text="Operation complete", type="text")]

# Image output
return [ImageContent(data=base64_data, mimeType="image/png", type="image")]

# Multiple outputs
return [
    TextContent(text="Screenshot captured", type="text"),
    ImageContent(data=screenshot, mimeType="image/png", type="image")
]
```

### EvaluationResult for Scoring

```python
from hud.tools.types import EvaluationResult

return EvaluationResult(
    reward=0.8,  # Score between 0-1
    done=True,   # Task complete?
    info={       # Additional context
        "score": 80,
        "threshold": 75,
        "details": "Form submitted successfully"
    }
)
```

See [types.py](https://github.com/hud-evals/hud-python/blob/main/hud/tools/types.py) for full definitions.

## Tool Discovery and Resources

Expose available functions as MCP resources:

```python
@mcp.resource("setup://functions")
async def list_setup_functions():
    """List all available setup functions."""
    return {
        "functions": [
            {"name": name, "description": func.__doc__}
            for name, func in setup_functions.items()
        ]
    }
```

See [resource examples](https://github.com/hud-evals/hud-python/blob/main/environments/browser/src/hud_controller/server.py#L67-L90).

## Error Handling Patterns

Consistent error handling across tools:

```python
from mcp import McpError
from mcp.types import INTERNAL_ERROR

try:
    result = await risky_operation()
except SpecificError as e:
    # Return error as content
    return [TextContent(text=f"Operation failed: {e}", type="text")]
except Exception as e:
    # Fatal errors
    raise McpError(INTERNAL_ERROR, f"Unexpected error: {e}")
```

## Next Steps

<CardGroup cols={2}>
  <Card title="Environment Examples" icon="code" href="https://github.com/hud-evals/hud-python/tree/main/environments">
    Study complete implementations
  </Card>

  <Card title="Advanced Design" icon="rocket" href="/build-environments/registry">
    Learn hub pattern, telemetry, and metadata
  </Card>
</CardGroup>

<Info>
  For step-by-step implementation guide, see the [environments README](https://github.com/hud-evals/hud-python/blob/main/environments/README.md).
</Info>


# Build Environments
Source: https://docs.hud.so/build-environments/index

Wrap any software in MCP for agent evaluation

Turn any software that runs in Docker into an agent-testable environment. Whether it's a web app, desktop application, game, or API, HUD provides the tools to make it MCP-compatible.

## What Defines a HUD Environment?

A HUD environment is a Docker container that:

1. **Runs an MCP server** via stdio transport
2. **Implements lifecycle tools** (`setup` and `evaluate`)
3. **Exposes interaction tools** for agents to use
4. **Works identically** locally and remotely

The key is the Docker `CMD` entrypoint - it launches an MCP server that can be connected to by any MCP client, enabling both local development and remote scaling.

## Architecture & Lifecycle

```mermaid
graph LR
    Agent[Agent] -->|MCP Protocol| Server[MCP Server]
    Server --> Setup[Setup Tool]
    Server --> Interact[Interaction Tools]
    Server --> Evaluate[Evaluate Tool]
    
    Setup -->|Initialize State| Environment
    Interact -->|Control| Environment
    Evaluate -->|Score Performance| Environment
```

### Lifecycle Flow

1. **Initialize**: MCP handshake establishes connection
2. **Setup**: Prepare environment to known state (hidden from agent)
3. **Interact**: Agent uses tools to complete task
4. **Evaluate**: Score agent's performance (hidden from agent)

## Environment Specification

### Required Components

Every HUD environment must have:

```python
from hud.server import MCPServer

mcp = MCPServer(name="my-environment")

# 1. Setup tool (lifecycle - hidden from agents)
@mcp.tool()
async def setup(**kwargs) -> dict:
    """Initialize environment state"""
    # Reset to known state
    # Load test data
    # Configure settings
    return {"status": "ready"}

# 2. Evaluate tool (lifecycle - hidden from agents)  
@mcp.tool()
async def evaluate(**kwargs) -> dict:
    """Score agent performance"""
    # Check task completion
    # Calculate reward (0-1)
    # Determine if done
    return {
        "reward": 0.8,
        "done": True,
        "info": {"details": "..."}
    }

# 3. Interaction tools (visible to agents)
@mcp.tool()
async def perform_action(param: str) -> dict:
    """Tool agents use to interact"""
    # Modify environment state
    # Return feedback
    return {"result": "success"}
```

### Standard File Structure

```
my-environment/
├── Dockerfile              # Container definition
├── pyproject.toml          # Python package config
├── README.md               # Environment documentation
└── src/
    └── hud_controller/     # Standard package name
        ├── __init__.py
        ├── server.py       # MCP server entry point
        ├── context.py      # Environment state (optional)
        ├── setup/          # Setup functions (if using hub)
        ├── evaluate/       # Evaluation functions (if using hub)
        └── tools/          # Custom interaction tools
```

### Docker Requirements

The Dockerfile must:

* Install dependencies
* Copy source code
* Set proper working directory
* **Define CMD that runs the MCP server**

```dockerfile
FROM python:3.11-slim

WORKDIR /app

# Copy and install
COPY pyproject.toml ./
COPY src/ ./src/
RUN pip install -e .

# Critical: Launch MCP server on stdio
CMD ["python", "-m", "hud_controller.server"]
```

This `CMD` is what enables the same image to work both locally and remotely:

* Locally: `docker run -i` pipes stdio
* Remotely: HUD platform manages stdio transport

## MCPServer vs FastMCP

The HUD SDK provides `MCPServer`, which extends FastMCP with Docker-friendly features:

### What MCPServer Adds

1. **SIGTERM Handling**: Graceful shutdown in containers
   ```python
   mcp = MCPServer(name="my-env")
   mcp.run()  # Automatically handles SIGTERM
   ```

2. **Initialize Decorator**: Async setup during MCP handshake
   ```python
   @mcp.initialize
   async def initialize_environment(ctx):
       # ctx.session: MCP ServerSession
       # ctx.meta: Metadata from client
       # ctx.meta.progressToken: For progress updates
       await setup_resources()
   ```

3. **Shutdown Decorator**: Cleanup after lifespan
   ```python
   @mcp.shutdown
   async def cleanup():
       await close_connections()
   ```

4. **BaseTool Support**: Enhanced tool registration
   ```python
   from hud.tools.base import BaseTool

   custom_tool = MyTool()  # Inherits from BaseTool
   mcp.add_tool(custom_tool)  # Uses tool.mcp automatically
   ```

## Common MCP Types

### Tool Outputs

All tools should return standardized types:

```python
from mcp.types import ContentBlock, TextContent, ImageContent

# Text output
return [TextContent(text="Operation complete", type="text")]

# Image output
return [ImageContent(
    data=base64_screenshot,
    mimeType="image/png",
    type="image"
)]

# Multiple outputs
return [
    TextContent(text="Status: Ready", type="text"),
    ImageContent(data=screenshot, mimeType="image/png", type="image")
]
```

### Evaluation Results

```python
from hud.tools.types import EvaluationResult

return EvaluationResult(
    reward=0.8,        # Score between 0-1
    done=True,         # Task complete?
    content="Details", # Optional description
    info={             # Additional metadata
        "score": 80,
        "threshold": 75
    }
)
```

## Quick Example

Here's a minimal environment:

```python
from hud.server import MCPServer
from mcp.types import TextContent

mcp = MCPServer(name="counter-env")

# State
counter = 0

@mcp.tool()
async def setup(start_value: int = 0):
    """Reset counter to starting value"""
    global counter
    counter = start_value
    return {"status": "ready", "counter": counter}

@mcp.tool()
async def increment():
    """Increment the counter"""
    global counter
    counter += 1
    return [TextContent(text=f"Counter: {counter}", type="text")]

@mcp.tool()
async def evaluate(target: int):
    """Check if counter reached target"""
    global counter
    return {
        "reward": 1.0 if counter >= target else counter / target,
        "done": counter >= target,
        "info": {"counter": counter, "target": target}
    }

if __name__ == "__main__":
    mcp.run()
```

## Getting Started

<CardGroup cols={3}>
  <Card title="TODO Tutorial" icon="list-check" href="/build-environments/quickstart-todo">
    Build your first environment step-by-step
  </Card>

  <Card title="Adapting Software" icon="wrench" href="/build-environments/docker-to-mcp">
    Patterns for wrapping existing applications
  </Card>

  <Card title="Advanced Design" icon="rocket" href="/build-environments/registry">
    Hub pattern, telemetry, and production tips
  </Card>
</CardGroup>

## Reference Implementations

Study these examples:

### text\_2048 (Simple)

* ASCII game with clear state
* Custom move tool
* Simple evaluation logic

### remote\_browser (Complex)

* Multi-provider browser automation
* Hub pattern for dozens of functions
* Production telemetry

## Key Concepts

### Tool Categories

1. **Setup Tools**: Initialize state (hidden from agents)
2. **Evaluate Tools**: Score performance (hidden from agents)
3. **Interaction Tools**: What agents use to control the environment

### MCP Communication

* **stdout**: Reserved for MCP protocol (JSON-RPC)
* **stderr**: All logs and debug output
* **stdio transport**: How containers communicate

### Process Separation for Hot-Reload Development

<Warning>
  **Critical Design Principle**: For environments that need to maintain state during development (like browsers, databases, or running applications), you must separate the MCP server process from the actual environment process. This separation enables hot-reload development without losing state.
</Warning>

The MCP server should be a lightweight process that:

1. Starts quickly and can restart without side effects
2. Connects to your environment process on initialization
3. Acts as a bridge between MCP protocol and your environment

The environment process should:

1. Run as a separate, persistent service
2. Maintain state across MCP server restarts
3. Accept connections from the MCP server

#### Communication Methods

You can connect the MCP server to your environment using:

* **Unix Sockets** (recommended for local development)
  ```python
  # In your MCP server
  from hud.server.context import attach_context
  ctx = attach_context("/tmp/my_env.sock")
  ```

* **TCP/HTTP API**
  ```python
  # Connect to HTTP API
  async with httpx.AsyncClient() as client:
      response = await client.get("http://localhost:8080/state")
  ```

* **gRPC**
  ```python
  # Connect via gRPC
  channel = grpc.aio.insecure_channel('localhost:50051')
  stub = MyServiceStub(channel)
  ```

* **Shared Memory/IPC**
  ```python
  # Using multiprocessing managers
  from multiprocessing.managers import BaseManager
  manager.connect()
  ```

#### Example: Browser Environment Pattern

The browser environment demonstrates this pattern:

1. **Context Server** (`context_server.py`): Persistent process managing browser state
2. **MCP Server** (`server.py`): Lightweight server that restarts on code changes
3. **Connection**: Unix socket at `/tmp/hud_browser_ctx.sock`

```python
# context_server.py - Runs once, maintains state
class PersistentBrowserContext:
    def __init__(self):
        self.browser = None
        self.apps = {}
    
    def startup(self):
        # Initialize browser, X11, VNC once
        self.browser = launch_browser()

# server.py - Can restart anytime
@mcp.initialize
async def initialize_environment(ctx):
    # Connect to persistent context
    persistent_ctx = attach_context("/tmp/hud_browser_ctx.sock")
    # Use existing browser state
    browser = persistent_ctx.get_browser()
```

This separation allows `hud dev` to restart the MCP server for code changes while keeping your browser, database, or application running.

### Testing Your Environment

#### Traditional Testing

```bash
# Debug all phases
hud debug my-environment:latest

# Analyze available tools
hud analyze my-environment:latest
```

#### Hot-Reload Development

```bash
# Start development server with live reload
hud dev --build

# Edit src/ files - changes are immediately reflected
# Test in another terminal
hud analyze hud-my-environment:dev
```

The `hud dev` command:

* Mounts your `src/` directory into containers
* Auto-detects or builds Docker images
* Provides HTTP endpoint for Cursor integration
* Eliminates rebuild cycles during development
* **Restarts only the MCP server**, not your environment state

<Tip>
  For stateful environments (browsers, databases, games), implement process separation so that `hud dev` hot-reload doesn't reset your environment state. See the browser environment for a complete example.
</Tip>

## SDK Reference

<CardGroup cols={2}>
  <Card title="MCPServer Reference" icon="server" href="/reference/environments">
    Complete API documentation for MCPServer
  </Card>

  <Card title="Tools Reference" icon="wrench" href="/reference/tools">
    Available tools and executors
  </Card>
</CardGroup>

## Next Steps

<Info>
  For the complete step-by-step guide with all implementation details, see the [environments README](https://github.com/hud-evals/hud-python/blob/main/environments/README.md).
</Info>


# TODO App Tutorial
Source: https://docs.hud.so/build-environments/quickstart-todo

Build your first MCP environment in 15 minutes

Let's build a TODO app environment from scratch. This hands-on tutorial will teach you the core concepts of MCP environment development.

## What We're Building

A simple TODO list that agents can interact with:

* Add, complete, and delete tasks
* List all tasks with their status
* Evaluate based on task completion

<Note>
  This tutorial assumes basic Python knowledge. Full code is available at `environments/todo-example`
</Note>

## Step 1: Project Structure

Create a new directory for your environment:

```bash
mkdir todo-env
cd todo-env
```

Create the following structure:

```
todo-env/
├── Dockerfile
├── pyproject.toml
├── README.md
└── src/
    └── hud_controller/
        ├── __init__.py
        ├── server.py
        └── todo_app.py
```

## Step 2: The TODO Application

First, let's build the core TODO logic:

```python src/hud_controller/todo_app.py
from dataclasses import dataclass
from typing import List, Optional
from datetime import datetime

@dataclass
class TodoItem:
    id: int
    title: str
    completed: bool = False
    created_at: datetime = None
    
    def __post_init__(self):
        if self.created_at is None:
            self.created_at = datetime.now()

class TodoApp:
    def __init__(self):
        self.todos: List[TodoItem] = []
        self.next_id = 1
    
    def add_todo(self, title: str) -> TodoItem:
        """Add a new TODO item"""
        todo = TodoItem(id=self.next_id, title=title)
        self.todos.append(todo)
        self.next_id += 1
        return todo
    
    def complete_todo(self, todo_id: int) -> bool:
        """Mark a TODO as completed"""
        for todo in self.todos:
            if todo.id == todo_id:
                todo.completed = True
                return True
        return False
    
    def delete_todo(self, todo_id: int) -> bool:
        """Delete a TODO item"""
        for i, todo in enumerate(self.todos):
            if todo.id == todo_id:
                self.todos.pop(i)
                return True
        return False
    
    def list_todos(self) -> List[dict]:
        """Get all TODOs as dictionaries"""
        return [
            {
                "id": todo.id,
                "title": todo.title,
                "completed": todo.completed,
                "created_at": todo.created_at.isoformat()
            }
            for todo in self.todos
        ]
    
    def clear_all(self):
        """Reset the TODO list"""
        self.todos = []
        self.next_id = 1
```

## Step 3: MCP Server Integration

Now wrap the TODO app with MCP:

```python src/hud_controller/server.py
import sys
import logging
from hud.server import MCPServer
from .todo_app import TodoApp

# Configure logging to stderr
logging.basicConfig(
    stream=sys.stderr,
    level=logging.INFO,
    format='[%(levelname)s] %(asctime)s | %(name)s | %(message)s'
)

# Create server and app instances
mcp = MCPServer(name="TODO Environment")
app = TodoApp()

# Interaction tools (visible to agents)
@mcp.tool()
async def add_todo(title: str) -> dict:
    """Add a new TODO item to the list"""
    todo = app.add_todo(title)
    return {
        "success": True,
        "todo": {
            "id": todo.id,
            "title": todo.title
        },
        "message": f"Added TODO: {title}"
    }

@mcp.tool()
async def complete_todo(todo_id: int) -> dict:
    """Mark a TODO item as completed"""
    success = app.complete_todo(todo_id)
    return {
        "success": success,
        "message": f"{'Completed' if success else 'Failed to complete'} TODO {todo_id}"
    }

@mcp.tool()
async def list_todos() -> dict:
    """List all TODO items"""
    todos = app.list_todos()
    return {
        "todos": todos,
        "total": len(todos),
        "completed": sum(1 for t in todos if t["completed"])
    }

@mcp.tool()
async def delete_todo(todo_id: int) -> dict:
    """Delete a TODO item"""
    success = app.delete_todo(todo_id)
    return {
        "success": success,
        "message": f"{'Deleted' if success else 'Failed to delete'} TODO {todo_id}"
    }

# Setup tool (hidden from agents)
@mcp.tool()
async def setup(initial_todos: list[str] = None) -> dict:
    """Initialize TODO list with optional items"""
    app.clear_all()
    
    if initial_todos:
        for title in initial_todos:
            app.add_todo(title)
    
    return {
        "status": "success",
        "message": f"Initialized with {len(initial_todos or [])} TODOs"
    }

# Evaluation tool (hidden from agents)
@mcp.tool()
async def evaluate(target_completed: int = None) -> dict:
    """Evaluate based on TODO completion"""
    todos = app.list_todos()
    total = len(todos)
    completed = sum(1 for t in todos if t["completed"])
    
    if target_completed is not None:
        # Specific target
        reward = 1.0 if completed >= target_completed else completed / target_completed
    else:
        # General completion rate
        reward = completed / total if total > 0 else 0.0
    
    return {
        "reward": reward,
        "done": reward >= 1.0,
        "info": {
            "completed": completed,
            "total": total,
            "details": f"{completed}/{total} TODOs completed"
        }
    }

# Initialize handler
@mcp.initialize
async def on_initialize(ctx):
    """Called when MCP connection is established"""
    logging.info("TODO environment ready!")

if __name__ == "__main__":
    mcp.run()
```

## Step 4: Package Configuration

Create a `pyproject.toml` file:

```toml pyproject.toml
[project]
name = "todo-env"
version = "0.1.0"
description = "TODO app MCP environment"
requires-python = ">=3.11"
dependencies = [
    "hud-python>=0.5.0",
]

[build-system]
requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"
```

## Step 5: Docker Configuration

Create the Dockerfile:

```dockerfile Dockerfile
FROM python:3.11-slim

# Prevent Python from buffering output
ENV PYTHONUNBUFFERED=1 \
    PYTHONDONTWRITEBYTECODE=1

WORKDIR /app

# Copy package files
COPY pyproject.toml ./
COPY src/ ./src/

# Install in editable mode for development
RUN pip install --no-cache-dir -e .

# Run as a module
CMD ["python", "-m", "hud_controller.server"]
```

## Step 6: Build and Test

### Option 1: Traditional Build & Test

Build your Docker image:

```bash
docker build -t todo-env .
```

Test with HUD CLI:

```bash
# Run the 5-phase debug check
hud debug todo-env

# If that passes, analyze the tools
hud analyze todo-env
```

### Option 2: Hot-Reload Development (Recommended)

Use the development proxy for instant feedback:

```bash
# Start development server with hot-reload
hud dev --build

# In another terminal, test your environment
hud analyze hud-todo-env:dev
```

Now you can edit `src/hud_controller/server.py` and see changes instantly without rebuilding!

<Check>
  Your environment should pass all 5 debug phases and show the available tools
</Check>

## Step 7: Create Tasks

Now create tasks for agents to solve:

```python
from hud.datasets import Task

tasks = [
    Task(
        prompt="Add three TODOs: buy groceries, call mom, finish report",
        mcp_config={
            "local": {
                "command": "docker",
                "args": ["run", "--rm", "-i", "todo-env:latest"]
            }
        },
        setup_tool={"name": "setup"},
        evaluate_tool={
            "name": "evaluate",
            "arguments": {"target_completed": 3}
        }
    ),
    Task(
        prompt="Complete all TODOs that contain the word 'urgent'",
        mcp_config={
            "local": {
                "command": "docker",
                "args": ["run", "--rm", "-i", "todo-env:latest"]
            }
        },
        setup_tool={
            "name": "setup",
            "arguments": {
                "initial_todos": [
                    "urgent: fix bug",
                    "buy coffee",
                    "urgent: submit report"
                ]
            }
        },
        evaluate_tool={
            "name": "evaluate",
            "arguments": {"target_completed": 2}
        }
    )
]
```

## Step 8: Test with an Agent

Run your first evaluation:

```python
import asyncio
import hud
from hud.agents import ClaudeAgent

async def test_todo_env():
    task = tasks[0]  # First task
    
    with hud.trace("todo-test"):
        agent = ClaudeAgent()
        result = await agent.run(task)
        
        print(f"Success: {result.reward == 1.0}")
        print(f"Details: {result.info}")

asyncio.run(test_todo_env())
```

## Publishing Your Environment

Once tested, publish to Docker Hub:

```bash
# Tag your image
docker tag todo-env your-dockerhub/todo-env:v1.0

# Push to registry
docker push your-dockerhub/todo-env:v1.0
```

### Using Your Environment Remotely

Others can now use your environment at scale:

```python
import os
from hud.datasets import Task
from hud.agents import ClaudeAgent

# Remote MCP configuration
task = Task(
    prompt="Create 3 todos and complete the urgent ones",
    mcp_config={
        "hud": {
            "url": "https://mcp.hud.so/v3/mcp",
            "headers": {
                "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
                "Mcp-Image": "your-dockerhub/todo-env:v1.0"
            }
        }
    },
    setup_tool={"name": "setup"},
    evaluate_tool={"name": "evaluate", "arguments": {"target_completed": 2}}
)

# Run with any agent
agent = ClaudeAgent()
result = await agent.run(task)
```

## Next Steps

This tutorial covered the basics. For production environments, explore:

<CardGroup cols={2}>
  <Card title="Complete Guide" icon="book" href="https://github.com/hud-evals/hud-python/blob/main/environments/README.md">
    Full 5-step process with advanced patterns
  </Card>

  <Card title="Advanced Design" icon="rocket" href="/build-environments/registry">
    Hub pattern, telemetry, and production tips
  </Card>
</CardGroup>

<Info>
  The official [environments README](https://github.com/hud-evals/hud-python/blob/main/environments/README.md) contains comprehensive implementation details, troubleshooting tips, and reference examples.
</Info>


# Advanced Design
Source: https://docs.hud.so/build-environments/registry

Production patterns for MCP environments

Learn advanced patterns for building production-ready MCP environments. These techniques are used in environments like `remote_browser` and enable sophisticated agent interactions.

## The Hub Pattern

The hub pattern uses BaseHub, which is actually an inner MCP server that can be mounted to your main server. This provides namespace separation and modular organization.

### How Hubs Work

A hub is an MCP server that exposes a single tool. When you call this tool, you specify which function within the hub to execute:

```json
{
    "name": "setup",
    "arguments": {
        "name": "navigate_to_page",
        "arguments": {
            "url": "https://example.com"
        }
    }
}
```

The outer `name` is the hub tool, inner `name` is the specific function, and inner `arguments` are passed to that function.

### Implementation

```python
# setup/__init__.py
from hud.tools.base import BaseHub

setup = BaseHub("setup")

# Import all setup modules
from . import navigation, data, auth

__all__ = ["setup"]
```

```python
# setup/navigation.py
from . import setup

@setup.tool()
async def navigate_to_page(url: str):
    """Navigate browser to specified URL."""
    env = setup.env  # Access shared environment
    await env.browser.goto(url)
    return {"status": "success", "url": url}

@setup.tool()
async def reset_browser():
    """Clear cookies and return to blank page."""
    env = setup.env
    await env.browser.clear_cookies()
    await env.browser.goto("about:blank")
```

```python
# server.py
from .setup import setup as setup_hub
from .evaluate import evaluate as evaluate_hub

mcp = MCPServer(name="my-environment")

@mcp.initialize
async def initialize_environment(ctx):
    # Create environment context
    env = await create_environment()
    
    # Set environment on hubs
    setup_hub.env = env
    evaluate_hub.env = env
    
    # Mount hubs to server
    mcp.mount(setup_hub)
    mcp.mount(evaluate_hub)
```

Benefits:

* **Modular organization**: Each function in its own file
* **Namespace separation**: Tools grouped logically
* **Shared state**: Access environment via `hub.env`
* **Auto-discovery**: Import registers all decorated functions

## Progress Notifications

For long-running initialization, provide real-time feedback:

```python
@mcp.initialize
async def initialize_environment(ctx):
    """Initialize with progress updates."""
    # Extract progress token from context
    progress_token = getattr(ctx.meta, "progressToken", None) if ctx.meta else None
    
    async def send_progress(progress: int, message: str):
        if progress_token:
            await ctx.session.send_progress_notification(
                progress_token=progress_token,
                progress=progress,
                total=100,
                message=message
            )
    
    await send_progress(10, "Starting services...")
    await start_x11_server()
    
    await send_progress(40, "Launching browser...")
    browser = await launch_browser()
    
    await send_progress(70, "Connecting to cloud provider...")
    await connect_provider()
    
    await send_progress(100, "Environment ready!")
```

This shows live in the HUD platform during container startup.

## Dynamic Configuration via Agent Metadata

Agents can pass metadata that gets injected into the MCP server's initialization context using `patch_mcp_config`:

```python
# In your agent
from hud.agents import ClaudeAgent

agent = ClaudeAgent(
    metadata={
        "display_width": 1280,
        "display_height": 720,
        "browser_provider": "anchorbrowser"
    }
)

# This metadata is patched into mcp_config via patch_mcp_config
# See hud/agents/base.py#L476-L479
```

### How It Works

The agent's `_setup_config` method uses `patch_mcp_config` to inject metadata:

```python
from hud.utils.mcp import MCPConfigPatch, patch_mcp_config

async def _setup_config(self, mcp_config: dict[str, dict[str, Any]]) -> None:
    """Inject metadata into the metadata of the initialize request."""
    if self.metadata:
        patch_mcp_config(
            mcp_config,
            MCPConfigPatch(meta=self.metadata),
        )
```

### Using Metadata in Your Environment

```python
@mcp.initialize
async def initialize_environment(ctx):
    # Access metadata from context
    metadata = ctx.meta
    
    # Adjust display resolution
    width = metadata.get("display_width", 1920)
    height = metadata.get("display_height", 1080)
    
    # Select browser provider
    provider = metadata.get("browser_provider", "chromium")
    
    # Configure performance
    headless = metadata.get("headless", False)
    
    # Apply configuration
    browser_config = {
        "viewport": {"width": width, "height": height},
        "headless": headless,
        "args": ["--disable-gpu"] if headless else []
    }
    
    browser = await launch_browser(browser_config)
```

See [patch\_mcp\_config implementation](https://github.com/hud-evals/hud-python/blob/main/hud/utils/mcp.py#L17-L35) for details.

## Live Telemetry

Expose real-time data via resources:

```python
@mcp.resource("telemetry://live")
async def get_telemetry():
    """Provide live environment status."""
    return {
        "status": "running",
        "live_url": get_live_view_url(),  # VNC or browser URL
        "metrics": {
            "memory_usage": get_memory_usage(),
            "active_connections": count_connections(),
            "uptime": get_uptime()
        },
        "logs": get_recent_logs(lines=50)
    }
```

This appears in the HUD platform UI for debugging.

## VNC Server Integration

For GUI environments, provide visual access:

```python
# Start from base image with VNC
FROM hudpython/novnc-base:latest

# Your GUI app installation
RUN apt-get update && apt-get install -y your-gui-app

# VNC is already configured on ports 5901 (VNC) and 8080 (noVNC)
```

In your MCP server:

```python
@mcp.resource("telemetry://live")
async def get_telemetry():
    # Provide noVNC URL for browser viewing
    return {
        "live_url": "http://localhost:8080/vnc.html",
        "vnc_port": 5901,
        "password": "vncpass"  # Set via VNC_PW env var
    }
```

## Using Base Images

HUD provides optimized base images:

### hudpython/novnc-base

* Ubuntu with XFCE desktop
* VNC and noVNC pre-configured
* Common utilities installed

```dockerfile
FROM hudpython/novnc-base:latest

# Add your GUI application
RUN apt-get update && apt-get install -y \
    libreoffice \
    firefox \
    your-app

# VNC auto-starts with your app
```

### hudpython/browser-base

* Chromium and Playwright installed
* Optimized for browser automation

```dockerfile
FROM hudpython/browser-base:latest

# Add your browser automation code
COPY src/ /app/src/
CMD ["python", "-m", "src.server"]
```

## Phase 6: Automated Improvement

Enable Cursor Agent to improve your environment:

### 1. Development Configuration

Use [reloaderoo](https://github.com/cameroncooke/reloaderoo) for hot-reload development:

```json
// .cursor/mcp.json
{
  "mcpServers": {
    "my-env-dev": {
      "command": "npx",
      "args": [
        "reloaderoo", "--",  // Enables hot-reload without restarting client
        "docker", "run", "-i", "--rm",
        "-v", "$(pwd)/src:/app/src:rw",
        "-e", "PYTHONDEBUG=1",
        "my-env:dev"
      ]
    }
  }
}
```

### 2. Cursor Rules

Create `.cursor/rules/mcp_environment_iteration.mdc`:

```markdown
---
description: Improve an MCP environment
alwaysApply: false
---

Setup
1. Ensure dev mode MCP config exists
2. Explore environment structure
3. Clarify improvement objectives

Iteration
1. Test current functionality via tools
2. Note errors and improvements in TODOs
3. Fix issues in source code
4. Call restart_server to reload
5. Verify fixes work correctly
6. Continue until all TODOs complete
```

### 3. Improvement Workflow

Prompt Cursor:

```
Context: I have a browser environment in my-env folder
Tools: Use my-env-dev to interact with it
Task: Test all evaluators and fix any that fail
```

The agent will:

1. Explore available tools
2. Test each function
3. Fix issues in your code
4. Reload and verify
5. Continue iterating

## Example: Production Browser Environment

Here's how `remote_browser` implements these patterns:

```python
# Modular structure
src/hud_controller/
├── server.py          # Main MCP server with progress
├── providers/         # Cloud browser integrations
├── setup/            # Hub with navigation, cookies
├── evaluate/         # Hub with page checks
└── tools/            # Playwright and computer tools

# Advanced features
- Progress notifications during startup
- Live browser URL in telemetry
- Dynamic resolution from metadata
- Multiple provider support
- Automatic cleanup on shutdown
```

## Next Steps

<CardGroup cols={2}>
  <Card title="Environment README" icon="book" href="https://github.com/hud-evals/hud-python/blob/main/environments/README.md">
    Complete guide with all implementation details
  </Card>

  <Card title="Reference Examples" icon="code" href="https://github.com/hud-evals/hud-python/tree/main/environments">
    Study production environments
  </Card>
</CardGroup>


# Architecture
Source: https://docs.hud.so/core-concepts/architecture

How HUD connects agents, environments, and evaluation

HUD is built as a thin orchestration layer on top of MCP, providing structure for agent evaluation, environment management, and telemetry.

## System Overview

```mermaid
graph TB
    subgraph "Remote"
        Dashboard["📊 Dashboard<br/>(app.hud.so)"]
        API["🔌 MCP Orchestrator<br/>(mcp.hud.so)"]
    end
    
    subgraph "Your Code"
        Agent["🤖 Agent<br/>(Claude/Operator)"]
        Task["📋 Task<br/>(Prompt + Evaluation)"]
        SDK["📦 hud-python"]
    end
    
    subgraph "Environments"
        LocalEnv["🖥️ Local Docker<br/>(Development)"]
        RemoteEnv["☁️ Remote Docker<br/>(100s Parallel)"]
    end
    
    subgraph "OpenTelemetry"
        Trace["📡 Traces & Metrics"]
    end
    
    Dataset["📚 Dataset<br/>(HuggingFace)"]
    
    AnyMCP["🔗 Any MCP Client<br/>(Cursor, Claude, Custom)"]
    
    Agent <--> SDK
    Task --> SDK
    Dataset <-.-> Task
    SDK <-->|"MCP"| LocalEnv
    SDK <-->|"MCP"| API
    API  <-->|"MCP"| RemoteEnv
    SDK  -->|"hud.trace"| Trace
    Trace --> Dashboard
    AnyMCP -->|"MCP"| API
    
    style Dashboard fill:#e0e7ff,stroke:#6366f1,stroke-width:2px
    style SDK fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
    style RemoteEnv fill:#d1fae5,stroke:#10b981,stroke-width:2px
    style AnyMCP fill:#fce7f3,stroke:#ec4899,stroke-width:2px,stroke-dasharray: 5 5
```

## Core Components

### 1. Agents (`hud.agents`)

Agents make decisions and call tools:

```python
from hud.agents import ClaudeAgent, OperatorAgent

# Built-in agents
agent = ClaudeAgent()      # Uses Anthropic's Claude
agent = OperatorAgent()    # Uses OpenAI Operator

# All agents inherit from MCPAgent
class MCPAgent:
    async def initialize(task: str | Task | None)
    async def run(prompt_or_task: str | Task, max_steps: int = 10) -> Trace
    async def call_tools(tool_calls: list[MCPToolCall]) -> list[MCPToolResult]
```

<Note>
  Agents can auto-create MCP clients from `task.mcp_config` - no manual client setup needed
</Note>

### 2. Tasks (`hud.Task`)

Tasks define what agents should accomplish:

```python
from hud.datasets import Task

task = Task(
    prompt="Complete 3 items in the TODO list",
    mcp_config={
        "hud": {
            "url": "https://mcp.hud.so/v3/mcp",
            "headers": {
                "Authorization": f"Bearer {api_key}",
                "Mcp-Image": "hudpython/hud-browser:latest"
            }
        }
    },
    # Tool names and arguments map directly to MCP server tools
    setup_tool={
        "name": "setup",  # Calls def setup(name, num_items) on the environment
        "arguments": {
            "name": "todo_seed",
            "num_items": 5
        }
    },
    evaluate_tool={
        "name": "evaluate",  # Calls def evaluate(name, expected_count) on the environment
        "arguments": {
            "name": "todo_completed",
            "expected_count": 3
        }
    }
)
```

<Note>
  The `name` and `arguments` in setup/evaluate tools correspond exactly to the tool names and parameters exposed by the MCP server
</Note>

### 3. MCP Clients (`hud.clients`)

Clients handle the MCP protocol:

```python
from hud.clients import MCPClient

# Auto-created by agents, or manual:
client = MCPClient(mcp_config)
await client.initialize()
tools = await client.list_tools()
```

### 4. Environments

Environments are MCP servers exposing tools:

```python
from hud.server import MCPServer

server = MCPServer("my-env")

@server.tool()
def move(direction: str):
    # Environment logic
    return result
```

### 5. Telemetry (`hud.trace`)

Real-time observability:

```python
import hud

with hud.trace("my-evaluation"):
    result = await agent.run(task)
    # View at app.hud.so/traces/{trace_id}
```

## Execution Flow

<Steps>
  <Step title="Task Definition">
    Create a `Task` with prompt and MCP configuration
  </Step>

  <Step title="Agent Initialization">
    Agent creates MCP client (if needed) and connects to environment
  </Step>

  <Step title="Setup Phase">
    Execute `setup_tool` to initialize environment state
  </Step>

  <Step title="Execution Loop">
    Agent receives observations, makes decisions, calls tools
  </Step>

  <Step title="Evaluation">
    Execute `evaluate_tool` to score performance
  </Step>

  <Step title="Telemetry">
    All interactions streamed to HUD backend for analysis
  </Step>
</Steps>

## Key Design Principles

1. **Protocol-First**: Everything speaks MCP
2. **Composable**: Mix and match agents, environments, evaluations
3. **Observable**: Built-in telemetry for every interaction
4. **Testable**: Reproducible evaluations with Docker
5. **Extensible**: Easy to add new agents or environments

<Note>
  The `MCPServer` class wraps FastMCP with lifecycle management, making it easy to build Docker-based environments
</Note>

## Next Steps

<CardGroup cols={2}>
  <Card title="Evaluate Agents" icon="robot" href="/evaluate-agents">
    Test and benchmark agents on standardized tasks
  </Card>

  <Card title="Build Environments" icon="cube" href="/build-environments">
    Create MCP-compatible environments for your software
  </Card>
</CardGroup>


# MCP Protocol
Source: https://docs.hud.so/core-concepts/mcp-protocol

Understanding the Model Context Protocol that connects agents to environments

The Model Context Protocol (MCP) is the skeleton hud uses to provide a standard way for AI agents to interact with any software environment through tool calls.

## Why MCP?

Traditional agent frameworks couple agents tightly to specific environments. MCP decouples them:

<CardGroup cols={2}>
  <Card title="Without MCP" icon="x">
    * Agent code hardcoded for each environment
    * No standardization across tools
    * Difficult to swap agents or environments
  </Card>

  <Card title="With MCP" icon="check">
    * Any agent works with any environment
    * Standard protocol for all interactions
    * Easy to swap components
  </Card>
</CardGroup>

## How It Works

MCP standardizes agent-environment communication through JSON-RPC messages. Agents call tools exposed by environments and receive structured responses.

## Core Concepts

### Tools

[Tools](https://modelcontextprotocol.io/specification/server/tools) are functions exposed by the environment:

```json
{
  "name": "move",
  "description": "Move in a direction",
  "inputSchema": {
    "type": "object",
    "properties": {
      "direction": {
        "type": "string",
        "enum": ["up", "down", "left", "right"]
      }
    }
  }
}
```

### Tool Calls & Results

Agents [call tools](https://modelcontextprotocol.io/specification/server/tools#calling-tools) and receive results:

```python
# Agent makes a tool call
result = await client.call_tool("move", {"direction": "up"})

# Environment returns result
{
  "content": [{
    "type": "text",
    "text": "Moved up. New board state: ..."
  }]
}
```

### Lifecycle Management

MCP defines a [rigorous lifecycle](https://modelcontextprotocol.io/specification/basic/lifecycle) for connections:

1. **Initialization**: Client and server negotiate capabilities and protocol version with `client.initialize()`
2. **Operation**: Normal tool calling and message exchange
3. **Shutdown**: Clean termination of the connection

The protocol ensures both sides understand each other's capabilities before proceeding.

## HUD's MCP Extensions

HUD adds conventions on top of MCP:

1. **Setup Tools**: Initialize environment state (`setup_board`, `navigate_to_url`)
2. **Evaluate Tools**: Score agent performance (`evaluate_max_tile`, `contains_text`)
3. **Lifecycle Management**: Clean initialization and shutdown with `client.initialize()` and proper cleanup

See the [Tools Reference](/reference/tools) for implementation details.

## Transport Options

HUD environments are designed for 100% reproducibility through Docker:

<Tabs>
  <Tab title="Local Docker Build">
    Run environments locally for development and debugging using [stdio transport](https://modelcontextprotocol.io/specification/basic/transports#stdio):

    ```python
    mcp_config = {
      "hud-text-2048": {
        "command": "docker",
        "args": ["run", "--rm", "-i", "hudpython/hud-text-2048:v1.2"]
      }
    }
    ```

    * **Transport**: [stdio](https://modelcontextprotocol.io/specification/basic/transports#stdio) - JSON-RPC over stdin/stdout
    * **Pros**: Full control, easy debugging, no network latency
    * **Use case**: Development, testing, single-agent runs

    <Info>
      The `-i` flag enables Docker's interactive mode, allowing stdio communication between the client and server process.
    </Info>
  </Tab>

  <Tab title="Remote Docker Launch">
    Scale to parallel runs with cloud infrastructure using [Streamable HTTP transport](https://modelcontextprotocol.io/specification/basic/transports#streamable-http):

    ```python
    mcp_config = {
      "hud": {
        "url": "https://mcp.hud.so/v3/mcp",
        "headers": {
          "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
          "Mcp-Image": "hudpython/hud-text-2048:v1.2"
        }
      }
    }
    ```

    * **Transport**: [Streamable HTTP](https://modelcontextprotocol.io/specification/basic/transports#streamable-http) with SSE
    * **Pros**: Infinite parallelization, no local setup, managed infrastructure
    * **Use case**: Benchmarks, CI/CD, multi-agent experiments

    <Info>
      HUD's orchestrator manages Docker containers in the cloud, exposing them via HTTP endpoints with Server-Sent Events for streaming.
    </Info>
  </Tab>
</Tabs>

<Check>
  Both approaches use the exact same Docker image, ensuring identical behavior whether running locally or remotely
</Check>

## Next Steps

<Card title="Architecture" icon="sitemap" href="/core-concepts/architecture">
  See how HUD builds on MCP for agent evaluation
</Card>


# Task System
Source: https://docs.hud.so/core-concepts/task-system

How tasks define agent objectives and evaluation criteria

Tasks define what agents should do and how to measure success. They combine a prompt, environment configuration, and optional setup/evaluation phases.

## Lifecycle

```mermaid
graph LR
    Task["📋 Task<br/>(Prompt + Config)"]
    Setup["🔧 Setup Phase<br/>(Initialize State)"]
    Execute["🤖 Execute Phase<br/>(Agent Works)"]
    Evaluate["📊 Evaluate Phase<br/>(Score Result)"]
    
    Task --> Setup
    Setup --> Execute
    Execute --> Evaluate
    
    style Task fill:#e0e7ff,stroke:#6366f1,stroke-width:2px
    style Execute fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
    style Evaluate fill:#d1fae5,stroke:#10b981,stroke-width:2px
```

## Task Structure

```python
from hud.datasets import Task
import uuid

task = Task(
    # Required fields
    prompt="Navigate to the login page and sign in as testuser@example.com",
    mcp_config={
        "hud": {
            "url": "https://mcp.hud.so/v3/mcp",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "hudpython/hud-browser:latest"
            }
        }
    },
    
    # Optional fields
    id=str(uuid.uuid4()),  # Required for HuggingFace datasets
    system_prompt="You are an expert web automation agent. Always verify page loads before interacting with elements.",
    setup_tool={
        "name": "playwright",
        "arguments": {
            "action": "navigate",
            "url": "https://example.com"
        }
    },
    evaluate_tool={
        "name": "evaluate",
        "arguments": {
            "name": "url_contains",
            "substring": "/dashboard"
        }
    },
    metadata={"category": "authentication", "difficulty": "easy"}
)
```

## Field Reference

<ParamField body="prompt" type="string" required>
  The instruction given to the agent. Be specific and clear about success criteria.
</ParamField>

<ParamField body="mcp_config" type="object" required>
  Environment connection configuration. Supports environment variables with `${VAR}` syntax.
</ParamField>

<ParamField body="id" type="string (UUID)" optional>
  Unique identifier. Required when creating HuggingFace datasets. Use `str(uuid.uuid4())`.
</ParamField>

<ParamField body="system_prompt" type="string" optional>
  Custom system prompt for the agent. Overrides the agent's default system prompt.
</ParamField>

<ParamField body="setup_tool" type="dict | list[dict]" optional>
  Tool(s) to initialize environment state with `name` and `arguments`
</ParamField>

<ParamField body="evaluate_tool" type="dict | list[dict]" optional>
  Tool(s) to score performance with `name` and `arguments`
</ParamField>

<ParamField body="metadata" type="object" optional>
  Additional task information for analysis and filtering
</ParamField>

## Environment Variables

<Important>
  Tasks automatically resolve `${VAR}` patterns when deserialized from dictionaries. This is why HuggingFace datasets should store raw dictionaries, not Task objects.
</Important>

```python
# In dataset JSON:
{
    "prompt": "Complete the TODO list",
    "mcp_config": {
        "hud": {
            "url": "https://mcp.hud.so/v3/mcp",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "${BROWSER_IMAGE}"
            }
        }
    }
}

# When loaded:
task = Task(**task_dict)  # Variables resolved here!
# Now task.mcp_config["hud"]["headers"]["Authorization"] = "Bearer sk-hud-..."
```

This enables:

* **Public datasets** without exposing secrets
* **Environment-specific** configurations
* **CI/CD pipelines** with different credentials

## Running Tasks

```python
# Agent automatically handles all phases
result = await agent.run(task)
print(f"Success: {result.reward}")  # 0.0 to 1.0
```

The agent will:

1. Execute `setup_tool` if provided
2. Work on the `prompt` using available tools
3. Execute `evaluate_tool` to calculate reward

## Working with Datasets

Tasks integrate with HuggingFace datasets:

```python
from datasets import load_dataset
from hud.datasets import run_dataset

# Load dataset from HuggingFace
dataset = load_dataset("hud-evals/sheetbench-50", split="train")

# Run agent on entire dataset
results = await run_dataset(
    "SheetBench Run",
    dataset,  # Pass the dataset object or name (e.g. "hud-evals/sheetbench-50")
    agent_class=ClaudeAgent,
    max_concurrent=5
)
```

## Creating Datasets

```python
from hud.datasets import save_tasks

# Create task dictionaries (NOT Task objects!)
task_dicts = [
    {
        "prompt": "Navigate to the login page",
        "mcp_config": {
            "hud": {
                "url": "${MCP_URL}",
                "headers": {"Authorization": "Bearer ${HUD_API_KEY}"}
            }
        },
        "setup_tool": {"name": "playwright", "arguments": {"action": "navigate", "url": "https://example.com"}},
        "evaluate_tool": {"name": "url_match", "arguments": {"pattern": ".*/login"}}
    },
    # More task dicts...
]

# Save to HuggingFace (preserves ${VAR} templates)
save_tasks(task_dicts, "my-org/my-benchmark")
```

<Warning>
  Always save dictionaries, not Task objects. Task objects have already resolved environment variables!
</Warning>

## Best Practices

1. **Use UUIDs**: Always include `id=str(uuid.uuid4())` for dataset tasks
2. **Clear Prompts**: Be specific about success criteria
3. **Template Variables**: Use `${VAR}` syntax for shareable configs
4. **Rich Tools**: Include both `name` and `arguments` in tool definitions

## Next Steps

<CardGroup cols={2}>
  <Card title="Run Benchmarks" icon="chart-line" href="/evaluate-agents/run-benchmarks">
    Use tasks with standard benchmarks
  </Card>

  <Card title="Custom Tasks" icon="wrench" href="/evaluate-agents/custom-tasks">
    Create your own evaluation scenarios
  </Card>
</CardGroup>


# Create Agents
Source: https://docs.hud.so/evaluate-agents/create-agents

Build your own MCP-compatible agent for HUD evaluation

Create custom agents that work with HUD's evaluation system. You can either extend the `MCPAgent` class for full control, or use the quickstart template for a simpler approach.

## Quick Start with Template

The fastest way to create an agent is using the HUD quickstart template:

```bash
# Clone the quickstart repository
uvx hud-python quickstart

# This creates a full agent implementation with:
# - Complete lifecycle example
# - Direct tool call handling
# - Ready-to-run code
```

The quickstart provides a working agent that you can modify to add your own logic for making tool calls.

## Building from Scratch

For full control, create an agent by inheriting from `MCPAgent` and implementing four core methods:

```python
from hud.agents import MCPAgent
from hud.types import AgentResponse, MCPToolCall, MCPToolResult

class MyAgent(MCPAgent):
    """Your custom agent implementation."""
    
    async def get_system_messages(self) -> list[Any]:
        """Return system messages for your LLM."""
        pass
    
    async def get_response(self, messages: list[Any]) -> AgentResponse:
        """Generate agent response with tool calls."""
        pass
    
    async def format_blocks(self, blocks: list[ContentBlock]) -> list[Any]:
        """Format content blocks for your LLM."""
        pass
    
    async def format_tool_results(
        self, tool_calls: list[MCPToolCall], 
        tool_results: list[MCPToolResult]
    ) -> list[Any]:
        """Format tool results back into messages."""
        pass
```

## Understanding the Interface

The methods provide a bridge between two worlds:

* **`messages: list[Any]`** - Your LLM's native message format (e.g., OpenAI's format)
* **`blocks: list[ContentBlock]`** - MCP's content blocks (text, images, etc.)
* **`tool_calls: list[MCPToolCall]`** and **`tool_results: list[MCPToolResult]`** - Native MCP types

These methods translate between your LLM's format and MCP's standardized format.

## Implementation Guide

### 1. Get System Messages

Define the system prompt for your agent:

```python
async def get_system_messages(self) -> list[Any]:
    """
    Return system messages in your LLM's format.
    """
    # Use the configured system prompt or default
    prompt = self.system_prompt or "You are a helpful assistant."
    
    # Return in your LLM's expected format
    return [{"role": "system", "content": prompt}]
```

### 2. Get Response Method

This is where your agent decides what to do:

```python
async def get_response(self, messages: list[Any]) -> AgentResponse:
    """
    Call your LLM and return tool calls.
    
    Args:
        messages: Conversation history in your LLM's format
        
    Returns:
        AgentResponse with content and tool_calls
    """
    # Call your LLM API
    response = await self.llm_client.chat(
        messages=messages,
        tools=self._available_tools  # Provided by MCPAgent
    )
    
    # Parse response into tool calls
    tool_calls = []
    if response.tool_calls:
        for tc in response.tool_calls:
            tool_calls.append(
                MCPToolCall(
                    name=tc.function.name,
                    arguments=json.loads(tc.function.arguments),
                    id=tc.id  # Optional tool call ID
                )
            )
    
    return AgentResponse(
        content=response.content,
        tool_calls=tool_calls
    )
```

### 3. Format Blocks Method

Convert MCP's content blocks into your LLM's message format:

```python
async def format_blocks(self, blocks: list[ContentBlock]) -> list[Any]:
    """
    Format content blocks from tools into messages.
    
    Args:
        blocks: List of MCP ContentBlock objects
        
    Returns:
        Messages in your LLM's expected format
    """
    # Example for OpenAI-style format
    content = []
    for block in blocks:
        if block.type == "text":
            content.append({"type": "text", "text": block.text})
        elif block.type == "image":
            content.append({
                "type": "image_url",
                "image_url": {"url": f"data:{block.mime_type};base64,{block.data}"}
            })
    
    return [{"role": "user", "content": content}]
```

### 4. Format Tool Results Method

Convert tool execution results back into messages:

```python
async def format_tool_results(
    self, 
    tool_calls: list[MCPToolCall], 
    tool_results: list[MCPToolResult]
) -> list[Any]:
    """
    Format tool results for the next LLM call.
    
    Args:
        tool_calls: The MCP tool calls that were made
        tool_results: MCP results from executing those tools
        
    Returns:
        Messages to append to conversation
    """
    messages = []
    
    for call, result in zip(tool_calls, tool_results):
        # Add the assistant's tool call
        messages.append({
            "role": "assistant",
            "content": None,
            "tool_calls": [{
                "id": call.id or str(uuid.uuid4()),
                "type": "function",
                "function": {
                    "name": call.name,
                    "arguments": json.dumps(call.arguments)
                }
            }]
        })
        
        # Add the tool result
        messages.append({
            "role": "tool",
            "tool_call_id": call.id,
            "content": json.dumps(result.content)
        })
    
    return messages
```

## Custom Agent Implementation

To create a custom agent, implement the `MCPAgent` protocol by defining these four core methods:

```python
from hud.agents import MCPAgent
from hud.types import AgentResponse, MCPToolCall, MCPToolResult
from typing import Any
import json

class CustomAgent(MCPAgent):
    """Custom agent implementation for any LLM provider."""
    
    def __init__(self, model: str = "your-model", **kwargs):
        super().__init__(**kwargs)
        self.model = model
        # Initialize your LLM client here
        # self.client = YourLLMClient()
    
    async def get_system_messages(self) -> list[Any]:
        """Return system messages for your LLM.
        
        This method should return the system prompt and any other
        initialization messages in your LLM's expected format.
        """
        prompt = self.system_prompt or "You are a helpful assistant that uses tools to complete tasks."
        return [{"role": "system", "content": prompt}]
    
    async def get_response(self, messages: list[Any]) -> AgentResponse:
        """Generate agent response with tool calls.
        
        This is where you call your LLM and parse the response into
        tool calls. Convert available MCP tools to your LLM's format.
        """
        # Convert MCP tools to your LLM's tool format
        tools = [
            {
                "name": tool.name,
                "description": tool.description,
                "parameters": tool.inputSchema
            }
            for tool in self._available_tools
        ]
        
        # Call your LLM with tools
        response = await self.call_llm(messages, tools)
        
        # Parse tool calls from response
        tool_calls = self.parse_tool_calls(response)
        
        return AgentResponse(
            content=response.get("content", ""),
            tool_calls=tool_calls
        )
    
    async def format_blocks(self, blocks: list[Any]) -> list[Any]:
        """Format content blocks for your LLM.
        
        Convert MCP content blocks (text, images) into your LLM's
        message format.
        """
        messages = []
        for block in blocks:
            if block.type == "text":
                messages.append({"role": "user", "content": block.text})
            elif block.type == "image":
                # Handle image blocks if your LLM supports them
                messages.append({
                    "role": "user", 
                    "content": [
                        {"type": "image", "image": block.data}
                    ]
                })
        return messages
    
    async def format_tool_results(
        self, 
        tool_calls: list[MCPToolCall], 
        tool_results: list[MCPToolResult]
    ) -> list[Any]:
        """Format tool results for your LLM.
        
        Convert tool call results back into your LLM's message format.
        """
        messages = []
        
        # Add assistant message with tool calls
        messages.append({
            "role": "assistant",
            "tool_calls": [
                {
                    "id": call.id,
                    "name": call.name,
                    "arguments": call.arguments
                }
                for call in tool_calls
            ]
        })
        
        # Add tool results
        for call, result in zip(tool_calls, tool_results):
            content = json.dumps(result.content) if result.content else ""
            messages.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": content
            })
        
        return messages

    # Helper methods you'll need to implement
    async def call_llm(self, messages: list[Any], tools: list[Any]) -> dict:
        """Call your LLM API and return the response."""
        # Implement your LLM API call here
        pass
    
    def parse_tool_calls(self, response: dict) -> list[MCPToolCall]:
        """Parse tool calls from your LLM's response format."""
        # Implement parsing logic for your LLM's tool call format
        pass
```

## Testing Your Agent

Test your agent on a simple task:

```python
import asyncio
import hud
import os
from hud.datasets import Task

async def test_agent():
    with hud.trace("test-custom-agent"):
        task = Task(
            prompt="Navigate to example.com",
            mcp_config={
                "hud": {
                    "url": "https://mcp.hud.so/v3/mcp",
                    "headers": {
                        "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
                        "Mcp-Image": "hudpython/hud-remote-browser:latest"
                    }
                }
            },
            setup_tool={
                "name": "setup",
                "arguments": {
                    "name": "navigate",
                    "arguments": {"url": "https://example.com"}
                }
            },
            evaluate_tool={
                "name": "evaluate",
                "arguments": {
                    "name": "url_match",
                    "arguments": {"pattern": "example.com"}
                }
            }
        )
        
        # Use your custom agent
        agent = CustomAgent(model="your-model")
        result = await agent.run(task)
        print(f"Reward: {result.reward}")

asyncio.run(test_agent())
```

## Built-in Agents

For quick prototyping, HUD provides built-in agents:

```python
from hud.agents import ClaudeAgent, OperatorAgent

# Claude (Anthropic)
claude_agent = ClaudeAgent(
    allowed_tools=["anthropic_computer"]
)

# OpenAI 
openai_agent = OperatorAgent(
    allowed_tools=["openai_computer"]
)
```

These are great for getting started, but implementing your own agent gives you full control over the conversation flow, tool selection, and response formatting.

## Advanced Features

### Custom System Prompts

```python
class MyAgent(MCPAgent):
    def __init__(self, **kwargs):
        super().__init__(
            system_prompt="You are an expert web automation agent.",
            **kwargs
        )
```

### Tool Filtering

```python
# Only allow specific tools
agent = MyAgent(
    allowed_tools=["click", "type", "playwright"],
    disallowed_tools=["execute_script"]
)
```

### Response Agent

The ResponseAgent is an OpenAI-powered helper that determines whether an agent should stop or continue based on the agent's messages. It's useful for handling ambiguous situations:

```python
from hud.agents.misc import ResponseAgent

# ResponseAgent analyzes agent messages like:
# "I've completed the form. Should I submit it?"
# And returns "STOP" or "CONTINUE"

agent = MyAgent(
    response_agent=ResponseAgent()  # Requires OPENAI_API_KEY
)
```

## Next Steps

<CardGroup cols={2}>
  <Card title="Leaderboards" icon="trophy" href="/evaluate-agents/leaderboards">
    Track and compare agent performance
  </Card>

  <Card title="Create Benchmarks" icon="flask" href="/evaluate-agents/create-benchmarks">
    Build custom evaluation datasets
  </Card>
</CardGroup>


# Create Benchmarks
Source: https://docs.hud.so/evaluate-agents/create-benchmarks

Build evaluation datasets using existing environments or custom setups

Create benchmarks to evaluate agent capabilities systematically. HUD makes it easy to build evaluation datasets using existing environments or by creating entirely custom setups.

## Using Existing Environments

The fastest way to create benchmarks is to leverage existing HUD environments and their evaluators.

### 1. Explore Available Evaluators

Use `hud analyze` to discover what's available:

```bash
$ hud analyze hudpython/hud-remote-browser:latest

📋 Evaluators:
├── url_match(pattern: str) - Check if URL matches pattern
├── page_contains(text: str, regex: bool = False) - Check page content  
├── element_exists(selector: str) - Check if CSS selector exists
├── cookie_exists(name: str) - Check if cookie exists
├── sheet_contains(cell: str, value: str) - Check spreadsheet cell
├── sheets_cell_values(args: dict) - Verify multiple cell values
└── ... more evaluators

📝 Setup Functions:
├── navigate(url: str) - Navigate to URL
├── cookies(action: str, ...) - Manage cookies
├── sheets(action: str, ...) - Spreadsheet operations
└── ... more setup functions
```

### 2. Create Tasks with Known Evaluators

Build tasks using the discovered evaluators:

```python
import uuid
from hud.datasets import Task

# Example: Web navigation benchmark
web_tasks = []

# Task 1: Simple navigation
web_tasks.append({
    "id": str(uuid.uuid4()),
    "prompt": "Navigate to the documentation page",
    "mcp_config": {
        "hud": {
            "url": "https://mcp.hud.so/v3/mcp",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "hudpython/hud-remote-browser:latest"
            }
        }
    },
    "setup_tool": {
        "name": "setup",
        "arguments": {
            "name": "navigate",
            "arguments": {"url": "https://example.com"}
        }
    },
    "evaluate_tool": {
        "name": "evaluate",
        "arguments": {
            "name": "url_match",
            "arguments": {
                "pattern": ".*/docs.*"
            }
        }
    },
    "metadata": {"difficulty": "easy", "category": "navigation"}
})

# Task 2: Form interaction
web_tasks.append({
    "id": str(uuid.uuid4()),
    "prompt": "Submit the contact form with the name 'Test User' and email 'test@example.com'",
    "setup_tool": {
        "name": "setup",
        "arguments": {
            "name": "navigate",
            "arguments": {"url": "https://example.com/contact"}
        }
    },
    "evaluate_tool": {
        "name": "evaluate",
        "arguments": {
            "name": "page_contains",
            "arguments": {
                "text": "Thank you for your submission"
            }
        }
    },
    "metadata": {"difficulty": "medium", "category": "forms"}
})
```

### 3. Test Your Benchmark

Test your benchmark with `run_dataset`:

```python
from hud.datasets import run_dataset
from hud.agents import ClaudeAgent

# Test your tasks
results = await run_dataset(
    "Web Navigation Test",
    web_tasks,  # Your list of task dicts
    agent_class=ClaudeAgent,
    max_concurrent=2
)

# Check results
success_rate = sum(r.reward > 0.5 for r in results) / len(results)
print(f"Success rate: {success_rate:.2%}")
```

## Publishing Benchmarks

All benchmarks are stored as HuggingFace datasets.

### Save Your Dataset

```python
from hud.datasets import save_tasks

# Save to HuggingFace (requires HF token)
save_tasks(
    web_tasks,  # List of task dictionaries
    repo_id="my-org/web-navigation-benchmark",
    private=False,  # Make it public
    tags=["web", "navigation", "automation"]
)
```

### Manual Upload

You can also create datasets manually:

1. Save tasks as JSON:

```python
import json

with open("benchmark.json", "w") as f:
    json.dump(form_tasks, f, indent=2)
```

2. Upload to HuggingFace:
   * Go to [https://huggingface.co/new-dataset](https://huggingface.co/new-dataset)
   * Upload your JSON file
   * Add a README with benchmark description

### Leaderboard Creation

<Note>
  Leaderboards are automatically created when the first scorecard is published for your dataset. Simply run an evaluation and create a scorecard at `app.hud.so/leaderboards/{your-dataset-id}`.
</Note>

## Extending Existing Environments

### Explore Available Evaluators

For deeper customization, explore existing evaluators and contribute new ones:

```python
# Example: Contributing a new evaluator
# In environments/remote_browser/src/hud_controller/evaluate/form_complete.py

from hud.tools.types import EvaluationResult
from . import evaluate

@evaluate.tool("form_complete")
async def form_complete(ctx, required_fields: list[str]):
    """Check if all required form fields are filled."""
    # Access environment from the hub
    env = evaluate.env  # BrowserEnvironmentContext or similar
    
    # Use environment methods
    page_content = await env.get_page_content()
    
    # Check each field (simplified example)
    filled_count = 0
    for field in required_fields:
        if f'value="{field}"' in page_content or f'>{field}<' in page_content:
            filled_count += 1
    
    success = filled_count == len(required_fields)
    return EvaluationResult(
        reward=1.0 if success else filled_count / len(required_fields),
        done=success,
        info={"filled": filled_count, "required": len(required_fields)}
    )
```

Submit evaluators via pull requests to expand environment capabilities.

## Creating Custom Environments

For complete control over evaluation logic, build your own environment.

<Card title="Build Environments" icon="cube" href="/build-environments">
  Complete guide to creating Docker-based MCP environments with custom evaluators
</Card>

## Example: SheetBench-50

Here's an actual task from SheetBench-50:

```python
# Example task from the dataset
sheetbench_task = {
    "id": "6e4744c7-b2c9-4bb6-807e-2cc144a4e8c2",
    "prompt": "Calculate from the RawData tab the z-scores from the mean close price for each row. Return, starting in ANSWER!A1 and descending to ANSWER!A5, the 5 dates with the greatest absolute value of standard deviations from the mean",
    "mcp_config": {
        "hud": {
            "url": "https://mcp.hud.so/v3/mcp",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Run-Id": "${RUN_ID}",
                "Mcp-Image": "hudpython/hud-remote-browser:v1.1"
            }
        }
    },
    "setup_tool": {
        "name": "setup",
        "arguments": {
            "name": "sheets_from_xlsx",
            "arguments": {
                "file_url": "https://gahludmjcsmszgyufydt.supabase.co//storage/v1/object/public/sheetbench/c6ddeb9a-0c16-4f5e-8a06-f148ebb4be8a/setup_input_2.xlsx?"
            }
        }
    },
    "evaluate_tool": {
        "name": "evaluate",
        "arguments": {
            "name": "sheets_cell_values",
            "arguments": {
                "args": {
                    "A1": "1/12/2024",
                    "A2": "1/10/2024",
                    "A3": "1/15/2024",
                    "A4": "1/11/2024",
                    "A5": "1/17/2024"
                }
            }
        }
    },
    "metadata": {
        "partial": True,
        "gold_file_url": "https://gahludmjcsmszgyufydt.supabase.co//storage/v1/object/public/sheetbench/c6ddeb9a-0c16-4f5e-8a06-f148ebb4be8a/gold_solution_2.xlsx?"
    }
}

# The dataset includes 50 such tasks covering:
# - Formula creation (VLOOKUP, SUMIF, etc.)
# - Data analysis (z-scores, correlations)
# - Data manipulation (sorting, filtering)
# - Chart creation
# - Pivot tables
```

## Best Practices

1. **Clear Instructions**: Write unambiguous prompts
2. **Isolated State**: Each task should be independent
3. **Measurable Success**: Use binary or graduated scoring
4. **Diverse Coverage**: Test different aspects of capability
5. **Difficulty Progression**: Include easy to hard tasks
6. **Metadata Tags**: Help users filter relevant tasks

## Next Steps

<CardGroup cols={3}>
  <Card title="Leaderboards" icon="trophy" href="/evaluate-agents/leaderboards">
    Track and compare agent performance on your benchmarks
  </Card>

  <Card title="Build Environments" icon="cube" href="/build-environments">
    Create custom MCP environments for unique evaluation scenarios
  </Card>

  <Card title="Train Agents" icon="brain" href="/train-agents">
    Use your datasets to train better agents with RL
  </Card>
</CardGroup>


# Evaluate Agents
Source: https://docs.hud.so/evaluate-agents/index

Test and benchmark your agents on standardized tasks

HUD makes it easy to evaluate any MCP-compatible agent on a variety of tasks and benchmarks. Whether you're testing Claude, Operator, or your custom agent, the evaluation flow is consistent.

## Quick Start

Evaluate an agent on a single task:

```python
import hud
import os
from hud.datasets import Task
from hud.agents import ClaudeAgent

with hud.trace("eval-demo"):
    task = Task(
        prompt="Create a spreadsheet with quarterly sales data",
        mcp_config={
            "hud": {
                "url": "https://mcp.hud.so/v3/mcp",
                "headers": {
                    "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
                    "Mcp-Image": "hudpython/hud-browser:latest"
                }
            }
        },
        setup_tool={
            "name": "playwright",
            "arguments": {
                "action": "navigate",
                "url": "https://sheets.google.com"
            }
        },
        evaluate_tool={
            "name": "evaluate", 
            "arguments": {
                "name": "sheet_contains",
                "text": "Q1 Sales"
            }
        }
    )
    
    agent = ClaudeAgent()
    result = await agent.run(task)
    print(f"Success: {result.reward > 0.5}")
```

## What You Can Do

### 1. Create Tasks for Browser Automation

```python
import os

task = Task(
    prompt="Navigate to GitHub and star the HUD repository",
    mcp_config={
        "hud": {
            "url": "https://mcp.hud.so/v3/mcp",
            "headers": {
                "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
                "Mcp-Image": "hudpython/hud-browser:latest"
            }
        }
    },
    setup_tool={
        "name": "setup",
        "arguments": {
            "name": "navigate",
            "url": "https://github.com"
        }
    },
    evaluate_tool={
        "name": "evaluate", 
        "arguments": {
            "name": "url_contains",
            "substring": "hud-python"
        }
    }
)
```

### 2. Run Existing Benchmarks

```python
from datasets import load_dataset
from hud.datasets import run_dataset

# Load and run SheetBench-50
dataset = load_dataset("hud-evals/sheetbench-50", split="train")
results = await run_dataset(
    "My SheetBench Run",
    dataset,
    agent_class=ClaudeAgent,
    max_concurrent=5
)
```

### 3. Make Your Agent Work with HUD

To create a custom agent, inherit from `MCPAgent` and implement the required methods:

```python
from hud.agents import MCPAgent
from hud.types import AgentResponse, MCPToolCall

class MyCustomAgent(MCPAgent):
    async def get_response(self, messages: list[Any]) -> AgentResponse:
        # Call your LLM and return tool calls
        ...
    
    async def format_blocks(self, blocks: list[Any]) -> list[Any]:
        # Format content blocks into messages for your LLM
        ...
    
    async def format_tool_results(
        self, tool_calls: list[MCPToolCall], 
        tool_results: list[Any]
    ) -> list[Any]:
        # Format tool results back into messages
        ...

# Now it works with any HUD dataset!
```

<Card title="Full Implementation Guide" icon="code" href="/evaluate-agents/create-agents">
  See complete examples and implementation details
</Card>

## Available Benchmarks

<CardGroup cols={2}>
  <Card title="SheetBench-50" icon="table">
    50 real-world spreadsheet tasks testing data manipulation, formulas, and analysis
  </Card>

  <Card title="OSWorld-Verified" icon="desktop">
    Desktop automation tasks across Ubuntu applications
  </Card>

  <Card title="OnlineMind2Web" icon="globe">
    Web navigation and interaction challenges (Coming Soon)
  </Card>

  <Card title="2048 Puzzles" icon="gamepad">
    Strategic planning in the 2048 game environment
  </Card>
</CardGroup>

## Exploring Environments

Use `hud analyze` to discover available tools and evaluators:

```bash
$ hud analyze hudpython/hud-browser:latest

🔍 Analyzing hudpython/hud-browser:latest...

📊 Environment Summary:
├── Tools: 15 available
├── Setup Functions: 8 available  
├── Evaluators: 12 available
└── Resources: 3 available

🛠️ Tools:
├── playwright(action: str, **kwargs) - Browser automation actions
├── click(selector: str) - Click element
├── type(selector: str, text: str) - Type text
└── ... 12 more tools

📋 Evaluators:
├── url_contains(substring: str) - Check if URL contains text
├── page_contains(text: str, regex: bool = False) - Check page content
├── element_exists(selector: str) - Check if CSS selector exists
├── todo_completed(expected_count: int) - Verify TODO completion
└── ... 8 more evaluators

Run with --json for full details or pipe to grep for filtering.
```

## Publishing to Leaderboards

After running evaluations, view results on the leaderboard:

```python
# Run evaluation
results = await run_dataset(
    "Claude-3.5 SheetBench",
    dataset="hud-evals/sheetbench-50",
    agent_class=ClaudeAgent
)

# Then visit: app.hud.so/leaderboards/hud-evals/sheetbench-50
# Click "My Jobs" to see your runs and create scorecards
```

## Key Features

* **Reproducible**: Docker environments ensure consistency
* **Parallel**: Run multiple evaluations concurrently
* **Observable**: Every tool call tracked with telemetry
* **Extensible**: Easy to add new tasks or benchmarks

## Next Steps

<CardGroup cols={3}>
  <Card title="Create Agents" icon="robot" href="/evaluate-agents/create-agents">
    Build your own MCP-compatible agent
  </Card>

  <Card title="Leaderboards" icon="trophy" href="/evaluate-agents/leaderboards">
    Track and compare agent performance
  </Card>

  <Card title="Create Benchmarks" icon="flask" href="/evaluate-agents/create-benchmarks">
    Build custom evaluation datasets
  </Card>
</CardGroup>


# Leaderboards
Source: https://docs.hud.so/evaluate-agents/leaderboards

Track and compare agent performance on standardized benchmarks

HUD automatically tracks all benchmark runs and provides leaderboards for comparing agent performance across teams and models.

## Running Benchmarks

Execute evaluations on standard datasets:

```python
from hud.datasets import run_dataset
from hud.agents import ClaudeAgent

# Run a benchmark
results = await run_dataset(
    name="Claude-3.5 SheetBench",
    dataset="hud-evals/sheetbench-50",
    agent_class=ClaudeAgent,
    agent_config={"model": "claude-3-5-sonnet-20241022"},
    max_concurrent=5
)
```

## Viewing Results

### On the Leaderboard Page

After running, view your results:

1. Navigate to `app.hud.so/leaderboards/{dataset-name}`
   * Example: `app.hud.so/leaderboards/hud-evals/sheetbench-50`
2. Click "My Jobs" to see your evaluation runs
3. Select runs to analyze performance

### Local Analysis

Calculate metrics locally:

```python
# Basic metrics
success_rate = sum(r.reward > 0.5 for r in results) / len(results)
avg_reward = sum(r.reward for r in results) / len(results)
avg_duration = sum(r.duration for r in results) / len(results)

print(f"Success Rate: {success_rate:.2%}")
print(f"Average Reward: {avg_reward:.3f}")
print(f"Average Duration: {avg_duration:.1f}s")
```

## Creating Scorecards

Transform your evaluation run into a public scorecard:

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/hud-f5fd7c15/images/create-scorecard.png" alt="Creating a scorecard from evaluation results" />
</Frame>

<Steps>
  <Step title="Run Evaluation">
    Execute your agent on the full dataset
  </Step>

  <Step title="Navigate to Leaderboard">
    Go to `app.hud.so/leaderboards/{dataset-name}`
  </Step>

  <Step title="Select Your Job">
    Click "My Jobs" and choose your evaluation run
  </Step>

  <Step title="Create Scorecard">
    Select your best runs (we compute statistics over multiple jobs) and create a scorecard
  </Step>

  <Step title="Publish">
    Your scorecard appears on the public leaderboard
  </Step>
</Steps>

<Frame>
  <img src="https://mintlify.s3.us-west-1.amazonaws.com/hud-f5fd7c15/images/leaderboard-example.png" alt="Example leaderboard showing different models and their scores" />
</Frame>

## Available Benchmarks

<CardGroup cols={2}>
  <Card title="SheetBench-50" href="https://app.hud.so/leaderboards/hud-evals/SheetBench-50">
    Spreadsheet manipulation and analysis tasks
  </Card>

  <Card title="OSWorld-Verified" href="https://app.hud.so/leaderboards/hud-evals/osworld-verified">
    Desktop automation in Ubuntu environment
  </Card>

  <Card title="2048 Puzzles" href="https://app.hud.so/leaderboards/hud-evals/2048-puzzles">
    Strategic planning and game playing
  </Card>

  <Card title="OnlineMind2Web" icon="clock">
    Web navigation challenges (Coming Soon)
  </Card>
</CardGroup>

## Best Practices

1. **Agent Naming**: Ensure your agent class has a meaningful name - it will be displayed on the leaderboard
2. **Multiple Runs**: Select your best few runs when creating a scorecard (we compute statistics over them)
3. **Consistent Configuration**: Use the same model version and parameters for fair comparison
4. **Share Results**: Make your scorecards public to contribute to the community

<Note>
  Leaderboards are automatically created when the first scorecard is published for a dataset
</Note>

## Next Steps

<CardGroup cols={2}>
  <Card title="Create Agents" icon="robot" href="/evaluate-agents/create-agents">
    Build your own MCP-compatible agent
  </Card>

  <Card title="Create Benchmarks" icon="flask" href="/evaluate-agents/create-benchmarks">
    Build custom evaluation datasets
  </Card>
</CardGroup>


# Introduction
Source: https://docs.hud.so/index

OSS RL environment + evals toolkit.

<CardGroup cols={3}>
  <Card title="I want to evaluate agents" icon="robot" href="/evaluate-agents">
    Test Claude, Operator, or custom agents on benchmarks like SheetBench and OSWorld
  </Card>

  <Card title="I want to build environments" icon="cube" href="/build-environments">
    Wrap any software in dockerized MCP for scalable and generalizable agent evaluation
  </Card>

  <Card title="I want to train agents" icon="brain" href="/train-agents">
    Use reinforcement learning and GRPO on evaluations to improve agent performance
  </Card>
</CardGroup>

## What is HUD?

HUD connects AI agents to software environments using the Model Context Protocol (MCP). Whether you're evaluating existing agents, building new environments, or training models with RL, HUD provides the infrastructure.

```mermaid
graph LR
    Agent["🤖 Any Agent<br/>(Claude, Operator, etc.)"]
    MCP["🔌 MCP Protocol<br/>(Tool Calls)"]
    Env["📦 Any Environment<br/>(Browser, OS, etc.)"]
    
    Agent -->|"call_tool()"| MCP
    MCP -->|"click(x, y)"| Env
    Env -->|"screenshot"| MCP
    MCP -->|"get_response()"| Agent
    
    style Agent fill:#e0e7ff,stroke:#6366f1,stroke-width:2px
    style MCP fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
    style Env fill:#d1fae5,stroke:#10b981,stroke-width:2px
```

## Why HUD?

* **🔌 MCP-native**: Any agent can connect to any environment
* **📡 Live telemetry**: Debug every tool call at [app.hud.so](https://app.hud.so)
* **🚀 Production-ready**: From local Docker to cloud scale
* **🎯 Built-in benchmarks**: OSWorld-Verified, SheetBench-50, and more
* **🔧 CLI tools**: Debug, analyze, create environments with `hud debug` and `hud analyze`

<CardGroup cols={2}>
  <Card title="3-minute quickstart" icon="bolt" href="/quickstart">
    Run your first agent evaluation with zero setup
  </Card>

  <Card title="Clone starter project" icon="code-branch">
    ```bash
    uvx hud-python quickstart
    ```
  </Card>
</CardGroup>

## Quick Example

```python
import asyncio, os, hud
from hud.datasets import Task
from hud.agents import ClaudeAgent

async def main():
    # Define evaluation task with remote MCP
    task = Task(
        prompt="Win a game of 2048 by reaching the 128 tile",
        mcp_config={
            "hud": {
                "url": "https://mcp.hud.so/v3/mcp",
                "headers": {
                    "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
                    "Mcp-Image": "hudpython/hud-text-2048:v1.2"
                }
            }
        },
        setup_tool={"name": "setup", "arguments": {"name": "board", "arguments": { "board_size": 4}}},
        evaluate_tool={"name": "evaluate", "arguments": {"name": "max_number", "arguments": {"target": 64}}}
    )
    
    # Run agent (auto-creates MCP client)
    agent = ClaudeAgent()
    result = await agent.run(task)
    print(f"Score: {result.reward}")

asyncio.run(main())
```

## Community

<CardGroup cols={2}>
  <Card title="GitHub" icon="github" href="https://github.com/hud-evals/hud-python">
    Star the repo and contribute
  </Card>

  <Card title="Discord" icon="discord" href="https://discord.gg/wkjtmHYYjm">
    Join our community
  </Card>
</CardGroup>

### Are you a startup building agents?

[📅 Hop on a call](https://cal.com/jay-ram-z6st6w/demo) or [📧 founders@hud.so](mailto:founders@hud.so)


# LLM Quickstart
Source: https://docs.hud.so/llm-quickstart

Add context about hud to any coding agent

## Setup

<Card title="Claude Desktop">
  ```bash
  claude mcp add --transport http docs-hud https://docs.hud.so/mcp
  ```
</Card>

<Card title="Cursor">
  Add to MCP settings:

  ```json
  "docs-hud": {
    "url": "https://docs.hud.so/mcp"
  }
  ```

  Or use one-click install:

  [![Install MCP Server](https://cursor.com/deeplink/mcp-install-dark.svg)](https://cursor.com/en/install-mcp?name=docs-hud-so\&config=eyJ1cmwiOiJodHRwczovL2RvY3MuaHVkLnNvL21jcCJ9)
</Card>

<Card title="Full Documentation" href="https://docs.hud.so/llms-full.txt">
  Copy paste the full llms-full.txt
</Card>

## What you get

Your AI assistant gains access to:

* Complete HUD API reference and methods
* Working code examples and implementations
* Architecture patterns and best practices
* Real-time updates as documentation evolves

<Tip>
  Try asking your assistant: "How do I create a custom agent in HUD?" or "Help me debug MCP tool calls"
</Tip>

## Next steps

<CardGroup cols={2}>
  <Card title="Run your first evaluation" icon="play" href="/quickstart">
    Get started with HUD in 3 minutes
  </Card>

  <Card title="Build an environment" icon="cube" href="/build-environments">
    Create custom MCP environments
  </Card>
</CardGroup>


# Quickstart
Source: https://docs.hud.so/quickstart

Run your first agent evaluation in 3 minutes

Get up and running with HUD in minutes. This guide walks you through installation, basic setup, and running your first agent evaluation.

## Quick Clone

The fastest way to get started:

```bash
# Clone a complete example project with uv
uvx hud-python quickstart
```

This sets you up with a working agent evaluation example you can run immediately.

## Installation

<Tabs>
  <Tab title="Basic Install">
    ```bash
    pip install hud-python
    ```
  </Tab>

  <Tab title="Agent Install">
    ```bash
    # Includes AI providers and telemetry
    pip install "hud-python[agent]"
    ```
  </Tab>

  <Tab title="CLI Tools">
    ```bash
    # Install CLI in isolated environment
    uv tool install hud-python
    ```
  </Tab>
</Tabs>

## API Keys

Set your API keys as environment variables:

```bash
export HUD_API_KEY="sk-hud-..."      # Get from app.hud.so
export ANTHROPIC_API_KEY="sk-ant-..." # For Claude agents
export OPENAI_API_KEY="sk-..."        # For OpenAI agents
```

<Tip>
  Create a `.env` file in your project root to manage API keys locally
</Tip>

## Your First Agent

Run an agent on the 2048 game environment:

```python
import asyncio, os
import hud
from hud.datasets import Task
from hud.agents import ClaudeAgent

async def main():
    # The trace context captures ALL agent interactions within a "task run"
    # Everything inside this context shows up as one trace on app.hud.so
    with hud.trace("quickstart-2048"):
        # Define task with remote MCP environment
        task = Task(
            prompt="Win a game of 2048 by reaching the 128 tile",
            mcp_config={
                "hud": {
                    "url": "https://mcp.hud.so/v3/mcp",
                    "headers": {
                        "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
                        "Mcp-Image": "hudpython/hud-text-2048:v1.2"
                    }
                }
            },
        setup_tool={"name": "setup", "arguments": {"name": "board", "arguments": { "board_size": 4}}},
        evaluate_tool={"name": "evaluate", "arguments": {"name": "max_number", "arguments": {"target": 64}}}
        )
        
        # Run agent (auto-creates MCP client from task.mcp_config)
        agent = ClaudeAgent()
        result = await agent.run(task)
        print(f"Max tile reached: {result.reward}")

asyncio.run(main())
```

<Check>
  The trace context ensures that the entire task run - from setup through evaluation - appears as one coherent trace on the platform
</Check>

## What just happened?

1. **Task Definition**: We created a `Task` with:
   * A prompt telling the agent what to do
   * An `mcp_config` pointing to a remote MCP environment
   * Setup and evaluation tools to initialize and score the game

2. **Auto Client**: The agent automatically created an MCP client from `task.mcp_config`

3. **Telemetry**: The `trace` context captured all interactions for debugging

4. **Evaluation**: The `evaluate_max_tile` tool returned the highest tile as reward

## Next Steps

<CardGroup cols={3}>
  <Card title="Understand MCP" icon="book" href="/core-concepts/mcp-protocol">
    Learn how agents connect to environments
  </Card>

  <Card title="Run Benchmarks" icon="chart-line" href="/evaluate-agents/run-benchmarks">
    Test on SheetBench and OSWorld
  </Card>

  <Card title="Build Environments" icon="cube" href="/build-environments/quickstart-todo">
    Create your own MCP environment
  </Card>
</CardGroup>

## CLI Quick Reference

```bash
# Analyze any MCP environment
hud analyze hudpython/hud-text-2048:v1.2

# Debug environment initialization
hud debug hudpython/hud-text-2048:v1.2

# Start hot-reload development server
hud dev . --build

# Run HUD tools as MCP server (for Cursor/Claude)
hud mcp
```

<Note>
  See the [CLI Reference](/reference/cli/overview) for detailed command documentation
</Note>


# Agents
Source: https://docs.hud.so/reference/agents

SDK reference for HUD agent classes

The HUD SDK provides a base `MCPAgent` class and several pre-built agent implementations for interacting with MCP environments.

## Base Class

### MCPAgent

```python
from hud.agents import MCPAgent
```

Abstract base class for all MCP-enabled agents. Handles tool discovery, filtering, and execution flow.

**Constructor Parameters:**

| Parameter                   | Type             | Description                                | Default        |
| --------------------------- | ---------------- | ------------------------------------------ | -------------- |
| `mcp_client`                | `AgentMCPClient` | MCP client for server connections          | `None`         |
| `allowed_tools`             | `list[str]`      | List of tool names to allow                | `None` (all)   |
| `disallowed_tools`          | `list[str]`      | List of tool names to disallow             | `[]`           |
| `lifecycle_tools`           | `list[str]`      | Tools hidden from agent (setup/evaluate)   | `[]`           |
| `initial_screenshot`        | `bool`           | Capture screenshot before first prompt     | `False`        |
| `system_prompt`             | `str`            | System prompt to use                       | Default prompt |
| `append_tool_system_prompt` | `bool`           | Append available tools to system           | `False`        |
| `append_setup_output`       | `bool`           | Append setup tool output to initial prompt | `False`        |
| `model_name`                | `str`            | Model name for telemetry                   | `"mcp-agent"`  |
| `response_agent`            | `ResponseAgent`  | Optional auto-response handler             | `None`         |
| `auto_trace`                | `bool`           | Enable automatic telemetry                 | `True`         |

**Key Methods:**

```python
async def initialize(task: str | Task | None = None) -> None
    """Initialize agent with task-specific configuration."""

async def run(prompt_or_task: str | Task | dict[str, Any], max_steps: int = 10) -> Trace
    """Run agent with prompt or task. Returns Trace with results."""

async def call_tools(tool_call: MCPToolCall | list[MCPToolCall]) -> list[MCPToolResult]
    """Execute tool calls through MCP client."""

def get_available_tools() -> list[types.Tool]
    """Get filtered list of available tools (excludes lifecycle)."""

def get_tool_schemas() -> list[dict]
    """Get tool schemas formatted for the model."""
```

**Abstract Methods (must implement):**

```python
async def get_system_messages() -> list[Any]
    """Get system prompt formatted for the model."""

async def get_response(messages: list[Any]) -> AgentResponse
    """Get model response including tool calls."""

async def format_blocks(blocks: list[ContentBlock]) -> list[Any]
    """Format content blocks into model messages."""

async def format_tool_results(tool_calls: list[MCPToolCall], 
                            tool_results: list[MCPToolResult]) -> list[Any]
    """Format tool results for the model."""
```

**Class Variables:**

* `metadata: dict[str, Any]` - Metadata injected into MCP initialize request

**Auto-Client Creation:**
If no `mcp_client` is provided but a `Task` with `mcp_config` is passed to `run()`, an MCPClient is automatically created and cleaned up.

## Pre-built Agents

### ClaudeAgent

```python
from hud.agents import ClaudeAgent
```

Claude-specific implementation using Anthropic's API.

**Constructor Parameters:**

| Parameter           | Type             | Description             | Default                        |
| ------------------- | ---------------- | ----------------------- | ------------------------------ |
| `model_client`      | `AsyncAnthropic` | Anthropic client        | Auto-created                   |
| `model`             | `str`            | Claude model to use     | `"claude-3-7-sonnet-20250219"` |
| `max_tokens`        | `int`            | Maximum response tokens | `4096`                         |
| `use_computer_beta` | `bool`           | Use computer-use beta   | `True`                         |

**Features:**

* Native Claude tool calling
* Automatic prompt caching
* Computer-use beta support
* Display metadata injection (1280x720)

**Example:**

```python
agent = ClaudeAgent(
    model="claude-3-5-sonnet-20241022",
    max_tokens=8192
)

result = await agent.run(
    Task(
        prompt="Navigate to example.com",
        mcp_config={"server": {...}},
        evaluate_tool={"name": "evaluate", "arguments": {...}}
    )
)
```

### OperatorAgent

```python
from hud.agents import OperatorAgent
```

OpenAI's Operator agent implementation.

**Constructor Parameters:**

| Parameter      | Type          | Description             | Default                     |
| -------------- | ------------- | ----------------------- | --------------------------- |
| `model_client` | `AsyncOpenAI` | OpenAI client           | Auto-created                |
| `model`        | `str`         | Model to use            | `"gpt-4o-realtime-preview"` |
| `max_tokens`   | `int`         | Maximum response tokens | `4096`                      |

**Features:**

* OpenAI function calling
* Operator system prompt
* Display metadata injection (1920x1080)

### GenericOpenAIChatAgent

```python
from hud.agents import GenericOpenAIChatAgent
```

Generic OpenAI chat completion agent for any OpenAI-compatible API.

**Constructor Parameters:**

| Parameter      | Type          | Description              | Default  |
| -------------- | ------------- | ------------------------ | -------- |
| `model_client` | `AsyncOpenAI` | OpenAI-compatible client | Required |
| `model`        | `str`         | Model name               | Required |
| `max_tokens`   | `int`         | Maximum response tokens  | `4096`   |

**Example:**

```python
# Use with local LLM
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="http://localhost:11434/v1",  # Ollama
    api_key="not-needed"
)

agent = GenericOpenAIChatAgent(
    model_client=client,
    model="llama3.1"
)
```

### LangChainAgent

```python
from hud.agents import LangChainAgent
```

LangChain integration for using any LangChain-compatible model.

**Constructor Parameters:**

| Parameter | Type            | Description          | Default  |
| --------- | --------------- | -------------------- | -------- |
| `model`   | `BaseChatModel` | LangChain chat model | Required |

**Example:**

```python
from langchain_anthropic import ChatAnthropic

model = ChatAnthropic(model="claude-3-opus-20240229")
agent = LangChainAgent(model=model)
```

### ArtHUDAgent

```python
from hud.agents import ArtHUDAgent
```

Integration with ART (Automatic Reasoning and Tool-use) models.

**Constructor Parameters:**

| Parameter    | Type       | Description             | Default  |
| ------------ | ---------- | ----------------------- | -------- |
| `art_model`  | `ArtModel` | ART model instance      | Required |
| `max_tokens` | `int`      | Maximum response tokens | `4096`   |

**Features:**

* Built-in Chain-of-Thought reasoning
* Structured reasoning traces
* Tool use optimization

## Helper Classes

### ResponseAgent

Base class for auto-response handlers that decide when to continue or stop.

```python
from hud.agents.misc import ResponseAgent

class MyResponseAgent(ResponseAgent):
    async def determine_response(self, agent_output: str) -> str:
        if "task complete" in agent_output.lower():
            return "STOP"
        return "Continue with the next step"
```

## Common Types

### AgentResponse

```python
from hud.types import AgentResponse

@dataclass
class AgentResponse:
    content: str | None = None
    tool_calls: list[MCPToolCall] | None = None
    done: bool = False
```

### MCPToolCall

```python
from hud.types import MCPToolCall

class MCPToolCall(BaseModel):
    name: str
    arguments: dict[str, Any] = {}
```

### MCPToolResult

```python
from hud.types import MCPToolResult

class MCPToolResult(BaseModel):
    content: list[ContentBlock]
    structuredContent: dict[str, Any] | None = None
    isError: bool = False
```

### Trace

```python
from hud.types import Trace

class Trace(BaseModel):
    reward: float = 0.0
    done: bool = True
    content: str | None = None
    isError: bool = False
    info: dict[str, Any] = {}
    steps: list[TraceStep] = []
```

## Usage Examples

### Simple Prompt Execution

```python
from hud.agents import ClaudeAgent
from hud.clients import MCPClient

# Manual client creation
client = MCPClient({
    "server": {
        "command": "docker",
        "args": ["run", "-i", "my-env:latest"]
    }
})

agent = ClaudeAgent(mcp_client=client)
await agent.initialize()

# Run with string prompt
result = await agent.run("Click the submit button", max_steps=5)
print(f"Result: {result.content}")
print(f"Success: {not result.isError}")
```

### Task Execution with Auto-Client

```python
from hud.agents import OperatorAgent
from hud.datasets import Task

# No client needed - auto-created from task
agent = OperatorAgent()

task = Task(
    prompt="Find the price of the product",
    mcp_config={
        "browser": {
            "url": "mcp://mcp.hud.so/v3/mcp",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "hudpython/hud-browser:latest"
            }
        }
    },
    setup_tool={
        "name": "setup",
        "arguments": {"url": "https://example.com"}
    },
    evaluate_tool={
        "name": "evaluate", 
        "arguments": {"check": "price_found"}
    }
)

# Client created automatically
result = await agent.run(task, max_steps=20)
print(f"Reward: {result.reward}")
```

### Custom Agent Implementation

```python
from hud.agents import MCPAgent
from hud.types import AgentResponse
import hud

class MyCustomAgent(MCPAgent):
    metadata = {"custom": "metadata"}
    
    async def get_system_messages(self) -> list[dict]:
        return [{
            "role": "system",
            "content": self.system_prompt
        }]
    
    @hud.instrument(span_type="agent", record_args=False, record_result=True)
    async def get_response(self, messages: list[dict]) -> AgentResponse:
        # Your LLM call here
        response = await self.llm.chat(messages)
        
        return AgentResponse(
            content=response.content,
            tool_calls=[
                MCPToolCall(name=tc.name, arguments=tc.args)
                for tc in response.tool_calls
            ],
            done=response.stop_reason == "stop"
        )
    
    async def format_blocks(self, blocks: list[ContentBlock]) -> list[dict]:
        content = []
        for block in blocks:
            if block.type == "text":
                content.append({"type": "text", "text": block.text})
            elif block.type == "image":
                content.append({
                    "type": "image",
                    "image": {"data": block.data, "format": "png"}
                })
        
        return [{"role": "user", "content": content}]
    
    async def format_tool_results(self, tool_calls, tool_results) -> list[dict]:
        return [{
            "role": "tool",
            "content": result.content,
            "tool_call_id": call.name
        } for call, result in zip(tool_calls, tool_results)]
```

## See Also

* [Create Agents](/evaluate-agents/create-agents) - Tutorial on building agents
* [Tasks](/reference/tasks) - Task configuration reference
* [Architecture](/core-concepts/architecture) - How agents fit in HUD


# hud analyze
Source: https://docs.hud.so/reference/cli/analyze

Inspect MCP environments quickly from metadata or live containers

The `hud analyze` command inspects MCP environments to discover their tools and capabilities. By default, it uses cached metadata for instant results. Use `--live` for real-time analysis.

## Usage

```bash
hud analyze <TARGET> [OPTIONS]
```

## Arguments

<ParamField path="target" type="string" required>
  Docker image, lock file path, or command to analyze
</ParamField>

## Options

<ParamField path="--format" type="string" default="interactive">
  Output format: `interactive`, `json`, or `markdown`. Short: `-f`
</ParamField>

<ParamField path="--verbose" type="boolean" default="false">
  Show full tool schemas and parameters. Short: `-v`
</ParamField>

<ParamField path="--live" type="boolean" default="false">
  Run container for live analysis (slower but more accurate)
</ParamField>

<ParamField path="--config" type="string">
  JSON config file with MCP configuration. Short: `-c`
</ParamField>

<ParamField path="--cursor" type="string">
  Analyze a server from Cursor config
</ParamField>

<ParamField path="--command" type="string">
  Analyze a local command instead of Docker image
</ParamField>

<ParamField path="--timeout" type="integer" default="30">
  Timeout in seconds for live analysis
</ParamField>

## Analysis Modes

### Fast Mode (Default)

Uses cached metadata from:

1. Local lock file cache (`~/.hud/locks/`)
2. HUD registry (if available)
3. Basic Docker manifest info

```bash
# Instant results from metadata
hud analyze hudpython/text-analyzer:latest

# From lock file
hud analyze ./my-env.lock.yaml
```

<Note>
  Fast mode is perfect for quick inspection and doesn't require running containers.
</Note>

### Live Mode

Runs the actual container for comprehensive analysis:

```bash
# Full analysis with running container
hud analyze hudpython/text-analyzer:latest --live

# With environment variables
hud analyze my-env:latest --live -e API_KEY=test
```

<Tip>
  Use `--live` when you need:

  * Real-time tool validation
  * Resource discovery
  * Telemetry information
  * Testing with specific env vars
</Tip>

## Output Formats

### Interactive (Default)

Fast mode output:

```
📊 Environment Overview
┌─────────────┬─────────────────────────┐
│ Image       │ hudpython/text-2048     │
│ Source      │ HUD Registry            │
│ Built       │ 2024-01-15T10:30:00Z    │
│ HUD Version │ 0.1.0                   │
│ Init Time   │ 450 ms                  │
│ Tools       │ 6                       │
└─────────────┴─────────────────────────┘

🔧 Available Tools
└── Tools
    ├── setup - Initialize environment
    ├── evaluate - Return environment state
    ├── move - Move tiles in direction
    ├── reset_board - Reset game to start
    └── get_score - Get current score
```

Live mode output (`--live`):

```
🔍 Analyzing MCP environment: hudpython/text-2048:latest

📊 Environment Overview
┌─────────────┬─────────────────────────┐
│ Server      │ hud-text-2048           │
│ Initialized │ ✓                       │
└─────────────┴─────────────────────────┘

🔧 Available Tools
├── Regular Tools
│   ├── move
│   │   └── Move tiles in the specified direction
│   ├── reset_board
│   │   └── Reset the game board to initial state
│   └── get_score
│       └── Get the current game score
└── Hub Tools
    └── game_hub
        ├── save_state
        └── load_state

📚 Available Resources
┌──────────────────┬────────────────┬─────────────┐
│ URI              │ Name           │ Type        │
├──────────────────┼────────────────┼─────────────┤
│ game://state     │ Game State     │ application │
│ game://history   │ Move History   │ text/plain  │
└──────────────────┴────────────────┴─────────────┘

📡 Telemetry Data
┌─────────┬─────────────────────────────┐
│ Live URL│ https://app.hud.so/xyz123   │
│ Status  │ running                     │
│ Services│ 2/2 running                 │
└─────────┴─────────────────────────────┘
```

### JSON Format

```bash
# Fast mode
hud analyze my-env --format json

{
  "image": "my-env:latest",
  "status": "from_cache",
  "tool_count": 6,
  "init_time": 450,
  "tools": [{
    "name": "setup",
    "description": "Initialize environment"
  }]
}

# Live mode
hud analyze my-env --format json --live

{
  "metadata": {
    "servers": ["my-env"],
    "initialized": true
  },
  "tools": [{
    "name": "setup",
    "description": "Initialize environment",
    "input_schema": {...}
  }],
  "hub_tools": {...},
  "resources": [...],
  "telemetry": {...}
}
```

### Markdown Format

```bash
hud analyze my-env --format markdown > docs/tools.md
```

Generates formatted documentation with tool descriptions and schemas.

## Examples

### Fast Analysis (Default)

```bash
# Quick inspection from metadata
hud analyze hudpython/text-analyzer:latest

# Analyze multiple environments rapidly
for env in browser scraper analyzer; do
  hud analyze "hudpython/$env:latest"
done

# From lock file
hud analyze ./environments/prod.lock.yaml

# With verbose schemas
hud analyze my-env:latest --verbose
```

### Live Analysis

```bash
# Full container analysis
hud analyze my-env:latest --live

# With environment variables
hud analyze my-env:latest --live -e API_KEY=test -e DEBUG=true

# Custom timeout for slow environments
hud analyze heavy-env:latest --live --timeout 60
```

### Local Development

```bash
# Analyze local command
hud analyze --command "python my_server.py" --live

# Cursor integration
hud analyze --cursor my-dev-server --live

# With config file
hud analyze my-env:latest --config mcp-config.json
```

### Documentation Generation

```bash
# Generate tool documentation
hud analyze my-env:latest --format markdown > docs/tools.md

# Compare versions
hud analyze my-env:v1.0 --format json > v1.json
hud analyze my-env:v2.0 --format json > v2.json
diff v1.json v2.json

# Extract tool names
hud analyze my-env --format json | jq -r '.tools[].name'
```

## Performance Comparison

| Mode            | Speed        | Use Case                    |
| --------------- | ------------ | --------------------------- |
| Fast (default)  | \< 1 second  | Quick inspection, CI checks |
| Live (`--live`) | 5-30 seconds | Full validation, debugging  |

<Tip>
  Use fast mode for rapid iteration during development. Switch to `--live` for final validation.
</Tip>

## CI/CD Integration

### Fast CI Checks

```yaml
# GitHub Actions - Quick tool validation
- name: Verify tools exist
  run: |
    TOOLS=$(hud analyze $IMAGE --format json | jq '.tool_count')
    if [ "$TOOLS" -lt 3 ]; then
      echo "Not enough tools!"
      exit 1
    fi
```

### Python Validation Script

```python
#!/usr/bin/env python3
import subprocess
import json

def verify_environment(image, required_tools):
    """Fast validation using metadata"""
    # Quick check with metadata
    result = subprocess.run(
        ["hud", "analyze", image, "--format", "json"],
        capture_output=True, text=True, check=True
    )
    
    data = json.loads(result.stdout)
    available = {tool["name"] for tool in data["tools"]}
    missing = set(required_tools) - available
    
    if missing:
        print(f"⚠️  Missing tools: {missing}")
        print("Running live analysis for details...")
        
        # Full check if needed
        subprocess.run(
            ["hud", "analyze", image, "--live"],
            check=True
        )
    return len(missing) == 0

# Usage
verify_environment("my-env:latest", ["setup", "evaluate", "process"])
```

## Troubleshooting

<AccordionGroup>
  <Accordion title="No metadata found">
    The environment may not have been pushed to HUD registry:

    ```bash
    # Falls back to basic Docker info
    hud analyze my-env:latest --verbose

    # Force live analysis
    hud analyze my-env:latest --live
    ```
  </Accordion>

  <Accordion title="Stale metadata">
    If metadata seems outdated:

    ```bash
    # Use live mode for current state
    hud analyze my-env:latest --live

    # Rebuild and push to update metadata
    hud build && hud push
    ```
  </Accordion>

  <Accordion title="Timeout in live mode">
    Some environments take longer to initialize:

    ```bash
    # Increase timeout
    hud analyze heavy-env:latest --live --timeout 120

    # Check with debug first
    hud debug heavy-env:latest --max-phase 1
    ```
  </Accordion>

  <Accordion title="JSON parsing errors">
    Ensure clean output:

    ```bash
    # Redirect stderr for clean JSON
    hud analyze my-env --format json 2>/dev/null

    # Or use jq for validation
    hud analyze my-env --format json | jq .
    ```
  </Accordion>
</AccordionGroup>

## Best Practices

1. **Default to Fast Mode**: Start with metadata for quick checks
2. **Live for Validation**: Use `--live` before production deployments
3. **Cache Lock Files**: Share lock files for consistent metadata
4. **Version Your Tools**: Track tool changes across versions
5. **Automate Checks**: Add fast analysis to CI pipelines

<Note>
  Fast mode shows what tools *should* be available based on build-time analysis. Live mode shows what *actually* works right now.
</Note>

## See Also

* [`hud debug`](/reference/cli/debug) - Test tool functionality
* [`hud pull`](/reference/cli/pull) - Get environments with metadata
* [`hud build`](/reference/cli/build) - Generate fresh metadata
* [`hud run`](/reference/cli/run) - Execute environments


# hud build
Source: https://docs.hud.so/reference/cli/build

Build production images and generate lock files for reproducible environments

The `hud build` command creates production Docker images and generates `hud.lock.yaml` files that capture all metadata needed for reproducible deployments.

## Usage

```bash
hud build [DIRECTORY] [OPTIONS]
```

## Arguments

<ParamField path="directory" type="string" default=".">
  Environment directory containing Dockerfile and pyproject.toml
</ParamField>

## Options

<ParamField path="--tag" type="string" default="auto">
  Docker image tag. Auto-generates based on directory name if not specified. Short: `-t`
</ParamField>

<ParamField path="--no-cache" type="boolean" default="false">
  Build without using Docker cache
</ParamField>

<ParamField path="--verbose" type="boolean" default="false">
  Show detailed build output. Short: `-v`
</ParamField>

## What It Does

The build command performs several critical steps:

<Steps>
  <Step title="Build Docker Image">
    Runs `docker build` with appropriate tags and build arguments
  </Step>

  <Step title="Analyze Environment">
    Connects to the built image and analyzes:

    * Initialization time
    * Available tools and descriptions
    * Required/optional environment variables
    * Protocol capabilities
  </Step>

  <Step title="Generate Lock File">
    Creates `hud.lock.yaml` with:

    * Image digest (sha256 hash)
    * Build metadata
    * Tool summaries
    * Environment requirements
  </Step>

  <Step title="Label Image">
    Adds metadata labels to the Docker image for tracking
  </Step>
</Steps>

## Lock File Structure

The generated `hud.lock.yaml` contains:

```yaml
version: "1.0"
image: "my-env:latest@sha256:abc123..."
build:
  generatedAt: "2024-01-15T10:30:00Z"
  hudVersion: "0.1.0"
  directory: "my-env"
environment:
  initializeMs: 450
  toolCount: 12
  variables:
    required: ["API_KEY", "DATABASE_URL"]
    optional: ["DEBUG", "LOG_LEVEL"]
tools:
  - name: "setup"
    description: "Initialize environment"
  - name: "evaluate"
    description: "Return environment state"
  - name: "process_data"
    description: "Process input data with advanced algorithms..."
```

## Examples

### Basic Build

Build with auto-generated tag:

```bash
hud build
# Creates: hud-my-env:latest
```

### Custom Tag

Specify version tag:

```bash
hud build . --tag v1.2.0
# Creates: hud-my-env:v1.2.0
```

### Clean Build

Force rebuild without cache:

```bash
hud build . --no-cache --verbose
```

### CI/CD Build

Build with specific naming:

```bash
hud build . --tag "prod-$(git rev-parse --short HEAD)"
# Creates: hud-my-env:prod-abc123f
```

## Image Naming

If no tag is specified via `--tag`, the command:

1. Checks `pyproject.toml` for `[tool.hud.image]`
2. Auto-generates: `hud-{directory-name}:latest`
3. Converts underscores to hyphens for Docker compatibility

Examples:

* Directory `my_tool` → Image `hud-my-tool:latest`
* Directory `browser-env` → Image `hud-browser-env:latest`

## Build Process

### Dockerfile Detection

The command expects a `Dockerfile` in the environment directory:

```dockerfile
FROM python:3.10-slim
RUN pip install uv
WORKDIR /app
COPY pyproject.toml .
RUN uv pip install --system -e .
COPY src/ src/
CMD ["python", "-m", "my_package.server"]
```

### Environment Variables

Automatically extracts from Dockerfile:

* `ENV` declarations → Optional variables
* `ARG` with no default → Required variables

```dockerfile
# Detected as required
ARG API_KEY

# Detected as optional
ENV DEBUG=false
ENV LOG_LEVEL=info
```

### Build Arguments

Passes through Docker build arguments:

```bash
hud build . --build-arg API_VERSION=v2
```

## Lock File Usage

The generated `hud.lock.yaml` enables:

<CardGroup cols={2}>
  <Card title="Reproducible Builds" icon="arrows-rotate">
    Exact image reference with digest ensures identical deployment
  </Card>

  <Card title="Quick Analysis" icon="gauge">
    Tool information available without running container
  </Card>

  <Card title="Registry Sharing" icon="share">
    Metadata for `hud push` to share environments
  </Card>

  <Card title="Dependency Tracking" icon="link">
    Version tracking for debugging and updates
  </Card>
</CardGroup>

## Integration

### With hud push

After building, share your environment:

```bash
hud build . --tag v1.0.0
hud push
```

### With CI/CD

```yaml
# GitHub Actions example
- name: Build HUD environment
  run: |
    hud build . --tag "${{ github.sha }}"
    
- name: Verify build
  run: |
    hud debug "hud-my-env:${{ github.sha }}"
    
- name: Push to registry
  run: |
    hud push
```

### Version Management

```bash
# Semantic versioning
hud build . --tag v1.0.0
hud build . --tag v1.1.0-beta

# Date-based
hud build . --tag "$(date +%Y%m%d)"

# Git-based
hud build . --tag "$(git describe --tags --always)"
```

## Troubleshooting

### Build Failures

<AccordionGroup>
  <Accordion title="Docker build error">
    Check Dockerfile syntax and ensure all files are present:

    ```bash
    docker build . --no-cache
    ```
  </Accordion>

  <Accordion title="Analysis timeout">
    The environment may be taking too long to initialize:

    ```bash
    hud build . --verbose  # See detailed logs
    ```
  </Accordion>

  <Accordion title="Lock file not generated">
    Ensure the environment implements required tools:

    * `setup()` - Must exist
    * `evaluate()` - Must return a dictionary
  </Accordion>
</AccordionGroup>

### Performance Tips

<Tip>
  Structure your Dockerfile to maximize layer caching:

  1. Install system dependencies first
  2. Copy dependency files (pyproject.toml)
  3. Install Python dependencies
  4. Copy source code last
</Tip>

<Warning>
  Using `--no-cache` significantly increases build time. Only use when necessary.
</Warning>

## Best Practices

1. **Tag Consistently**: Use semantic versioning or git-based tags
2. **Cache Wisely**: Structure Dockerfile for optimal caching
3. **Test Builds**: Always run `hud debug` after building
4. **Document Variables**: Clearly indicate required vs optional env vars
5. **Keep Images Small**: Use slim base images and multi-stage builds

## See Also

* [`hud init`](/reference/cli/init) - Create new environments
* [`hud dev`](/reference/cli/dev) - Development with hot-reload
* [`hud push`](/reference/cli/push) - Share to registry
* [`hud analyze`](/reference/cli/analyze) - Inspect built images
* [Build Environments](/build-environments) - Complete guide


# hud debug
Source: https://docs.hud.so/reference/cli/debug

Test MCP environments through 5 validation phases

The `hud debug` command validates MCP environments through 5 progressive phases.

## Synopsis

```bash
hud debug [OPTIONS] IMAGE [DOCKER_ARGS...]
hud debug --command COMMAND [ARGS...]
hud debug --cursor SERVER_NAME
```

## Options

| Option        | Description                       | Default |
| ------------- | --------------------------------- | ------- |
| `--max-phase` | Stop at specified phase (1-5)     | 5       |
| `--command`   | Debug a command instead of Docker | -       |
| `--cursor`    | Debug a Cursor MCP server         | -       |
| `--timeout`   | Timeout per phase in seconds      | 30      |
| `-e, --env`   | Set environment variables         | -       |

## Debug Phases

### Phase 1: Container Startup

```
🚀 Phase 1: Container startup
✓ Container started successfully
✓ Received output on stderr
```

Checks: Process starts, outputs to stderr, doesn't exit

### Phase 2: MCP Initialization

```
🔗 Phase 2: MCP initialization
✓ Received valid initialize response
✓ Protocol version: 1.0
```

Checks: JSON-RPC communication, protocol compatibility

### Phase 3: Tool Discovery

```
🔧 Phase 3: Tool discovery
✓ Found 8 tools:
  • setup_board - Initialize game board
  • move - Move in direction
```

Checks: Tools registered, valid schemas

### Phase 4: Tool Execution

```
🧪 Phase 4: Tool testing
→ Testing tool: setup_board
✓ Tool executed successfully
```

Checks: Tools callable, return valid responses

### Phase 5: Readiness Check

```
✅ Phase 5: Readiness check
✓ Ready for agent interaction
```

Checks: Overall health, state persistence

<Note>
  Once your environment passes Phase 5, consider using `hud dev` for hot-reload development. This eliminates rebuild cycles and speeds up iteration.
</Note>

## Examples

### Docker Debug

```bash
# Basic
hud debug my-server:latest

# With environment variables
hud debug my-server:latest -e DEBUG=true -e LOG_LEVEL=debug

# Stop at phase 3
hud debug my-server:latest --max-phase 3
```

### Local Development

```bash
# Python script
hud debug --command "python my_server.py"

# With arguments
hud debug --command "python -m my_module.server --port 8000"
```

### Cursor Integration

```bash
# Debug Cursor server
hud debug --cursor my-dev-env
```

## Common Issues

### Phase 1 Failures

* **Container exits**: Check CMD/ENTRYPOINT
* **No stderr**: Add `print(..., file=sys.stderr)`
* **Permission denied**: Check file permissions

### Phase 2 Failures

* **Invalid JSON**: Only JSON-RPC on stdout
* **Timeout**: Check for blocking operations

### Phase 3 Failures

* **No tools**: Check `@server.tool()` decorators
* **Invalid schema**: Add type hints and docstrings

## Advanced Usage

### Incremental Debugging

```bash
# Debug phase by phase
for phase in 1 2 3 4 5; do
    echo "Testing phase $phase..."
    hud debug my-env:latest --max-phase $phase || break
done
```

### Development Mode

```bash
# Once debug passes, use hot-reload for development
cd environments/my-env
hud dev . --build

# Make changes to src/ files
# Test changes without rebuilding
hud analyze hud-my-env:dev
```

### Parallel Testing

```bash
# Test multiple environments
images=("env1:latest" "env2:latest")
for image in "${images[@]}"; do
    hud debug "$image" > "debug_${image//:/}.log" 2>&1 &
done
wait
```

## Best Practices

1. Start with phase 1 and work up
2. Use `--max-phase` for incremental debugging
3. Check stderr output first
4. Keep tools simple for easier debugging
5. Add health check tools

<Note>
  If debug passes, agents should work reliably with your environment.
</Note>

## Next Step

<Card title="Analyze Command" icon="magnifying-glass" href="/reference/cli/analyze">
  Explore environment capabilities after debugging
</Card>


# hud dev
Source: https://docs.hud.so/reference/cli/dev

Hot-reload development server for MCP environments

The `hud dev` command provides hot-reload development for MCP environments, automatically restarting your server when source files change.

## Synopsis

```bash
hud dev [DIRECTORY] [OPTIONS]
```

## Description

`hud dev` creates a development proxy server that runs your MCP environment in Docker with hot-reload capabilities. It automatically detects or builds your Docker image and exposes it via HTTP or stdio for development.

**Key Features:**

* **Auto-detection**: Automatically finds or generates Docker image names
* **Hot-reload**: Mounts local `src/` directory for live code changes
* **Interactive testing**: Test tools directly from the command line with arrow keys
* **HTTP/Stdio protocols**: Choose transport method
* **Inspector support**: Launch MCP Inspector for debugging
* **Container management**: Handles Docker lifecycle automatically

## Arguments

| Argument    | Description                      | Default       |
| ----------- | -------------------------------- | ------------- |
| `DIRECTORY` | Environment directory (optional) | `.` (current) |

## Options

| Option              | Description                                  | Default       |
| ------------------- | -------------------------------------------- | ------------- |
| `--image`, `-i`     | Docker image name (overrides auto-detection) | Auto-detected |
| `--build`, `-b`     | Build image before starting                  | `false`       |
| `--no-cache`        | Force rebuild without cache                  | `false`       |
| `--transport`, `-t` | Transport protocol: `http` or `stdio`        | `http`        |
| `--port`, `-p`      | HTTP server port (ignored for stdio)         | `8765`        |
| `--interactive`     | Launch interactive tool testing interface    | `false`       |
| `--no-reload`       | Disable hot-reload functionality             | `false`       |
| `--verbose`, `-v`   | Show detailed server logs                    | `false`       |
| `--inspector`       | Launch MCP Inspector (HTTP mode only)        | `false`       |
| `--no-logs`         | Disable streaming Docker logs                | `false`       |

## Examples

### Auto-Detection Mode (Recommended)

Start development server without specifying directory (auto-detects from current location):

```bash
hud dev
```

### Build and Start

Build Docker image first, then start development server:

```bash
hud dev --build
```

### Specific Directory

Watch a specific environment directory:

```bash
hud dev environments/my-browser-env --build
```

### Custom Image

Use a specific Docker image name:

```bash
hud dev . --image my-custom-env:dev --build
```

### HTTP Mode with Inspector

Start in HTTP mode and launch MCP Inspector:

```bash
hud dev . --build --transport http --inspector
```

### Stdio Mode

Use stdio transport (for multiple simultaneous connections):

```bash
hud dev . --build --transport stdio
```

### Clean Rebuild

Force a complete rebuild without Docker cache:

```bash
hud dev . --build --no-cache
```

### Verbose Logging

Show detailed logs from the container:

```bash
hud dev . --build --verbose
```

### Interactive Testing

Launch with interactive tool testing interface:

```bash
hud dev . --build --interactive
```

<Note>
  Interactive mode disables Docker logs and hot-reload to provide a clean interface for testing tools with arrow key navigation.
</Note>

## How It Works

1. **Image Resolution**:
   * Checks `pyproject.toml` for `[tool.hud.image]` setting
   * Auto-generates name like `hud-{directory-name}:dev` if not found
   * Uses `--image` override if provided

2. **Docker Setup**:
   * Builds image with `--build` if requested
   * Mounts local `src/` directory to `/app/src` in container
   * Injects hot-reload supervisor (unless `--no-reload`)

3. **Proxy Server**:
   * Creates FastMCP proxy that forwards to Docker containers
   * Each client connection gets its own container instance
   * Automatically handles container lifecycle

4. **Development Flow**:
   * Edit files in `src/` directory
   * Changes are immediately reflected in running containers
   * No rebuild required for code changes
   * **Only the MCP server restarts**, not your environment state

### Process Separation Architecture

For stateful environments, `hud dev` supports a critical design pattern: separating the MCP server from the environment process. This separation enables hot-reload without losing state.

**Why Separation Matters:**

* MCP server can restart instantly for code changes
* Environment state (browsers, databases, games) persists
* Development is faster without constant state resets

**Example Architecture:**

```
┌─────────────────┐     ┌──────────────────────┐
│   MCP Server    │────▶│ Environment Process  │
│  (Restartable)  │     │    (Persistent)      │
│                 │     │                      │
│ - Lightweight   │     │ - Maintains state    │
│ - Fast startup  │     │ - Long-lived         │
│ - MCP protocol  │     │ - Resource heavy     │
└─────────────────┘     └──────────────────────┘
        ▲                         │
        │                         │
        └─── Unix Socket ─────────┘
             TCP, gRPC, etc.
```

**Connection Methods:**

* Unix sockets (recommended for local dev)
* TCP/HTTP endpoints
* gRPC services
* Shared memory/IPC

See the browser environment for a complete implementation example.

## File Mounting

The development server mounts your local `src/` directory directly into containers:

```
Local: ./src/                → Container: /app/src/
Local: ./src/my_package/     → Container: /app/src/my_package/
Local: ./src/server.py       → Container: /app/src/server.py
```

**Live Changes**: Any edits to files in `src/` are immediately reflected in running containers.

**Hot-Reload Behavior**:

* With process separation: Only the MCP server restarts, environment state persists
* Without separation: Entire environment restarts (state is lost)
* Disable with `--no-reload` if manual control is preferred

<Note>
  For environments with important state (browsers, databases, running applications), implement process separation to preserve state during hot-reload. This dramatically improves development speed.
</Note>

## Interactive Mode

When launched with `--interactive`, `hud dev` provides a command-line interface for testing tools:

```bash
hud dev . --interactive
```

**Features:**

* **Arrow Key Navigation**: Select tools using up/down arrows
* **Parameter Input**: Enter tool parameters with validation
* **Result Display**: See tool responses in real-time
* **Clean Interface**: Disables logs for focused testing
* **Exit Gracefully**: Select "Quit" or press Ctrl+C

**Workflow:**

1. Environment starts and connects
2. Tools list appears with arrow navigation
3. Select a tool and press Enter
4. Input parameters if required
5. See results immediately
6. Continue testing or quit

<Tip>
  Interactive mode is perfect for rapid tool development - make changes to your code and test immediately without leaving the terminal.
</Tip>

## Transport Modes

### HTTP Transport (Default)

```bash
hud dev . --transport http --port 8080
```

**Benefits:**

* Web browser access
* MCP Inspector integration
* Multiple simultaneous connections
* RESTful debugging

**URL:** `http://localhost:8080/mcp`

### Stdio Transport

```bash
hud dev . --transport stdio
```

**Benefits:**

* Direct MCP protocol
* Lower latency
* Cursor/Claude integration ready
* Single connection focused

**Usage:** Connect via stdio to the running container

## Integration Examples

### Cursor Integration

1. Start development server:
   ```bash
   hud dev . --build --transport http --port 8765
   ```

2. Add to Cursor's MCP config:
   ```json
   {
     "mcpServers": {
       "my-dev-env": {
         "url": "http://localhost:8765/mcp"
       }
     }
   }
   ```

3. Edit files - changes apply immediately!

### Testing During Development

Run analysis in another terminal:

```bash
# Test tools are working
hud analyze hud-my-env:dev

# Full debug check
hud debug hud-my-env:dev
```

### MCP Inspector

Launch inspector for visual debugging:

```bash
hud dev . --build --inspector
```

Opens browser with MCP Inspector interface showing:

* Available tools and schemas
* Real-time tool calls
* Protocol messages
* Error debugging

## Troubleshooting

### Port Already in Use

```bash
# Use different port
hud dev . --port 9000

# Or find and kill existing process
lsof -ti :8765 | xargs kill
```

### Build Failures

```bash
# Clean rebuild
hud dev . --build --no-cache

# Check Dockerfile
docker build -t my-env .
```

### File Changes Not Detected

```bash
# Check file permissions
ls -la src/

# Force restart (Ctrl+C then restart)
hud dev . --build --verbose
```

### Container Startup Issues

```bash
# Debug with full logs
hud dev . --build --verbose

# Test image manually
docker run -it my-env:dev
```

## Performance Tips

1. **Exclude Large Files**: Use `.dockerignore` to exclude large files from build context
2. **Layer Caching**: Structure Dockerfile for optimal layer caching
3. **Minimal Rebuilds**: Only use `--build` when dependencies change
4. **Port Selection**: Use uncommon ports to avoid conflicts

## Development Workflow

Recommended workflow for environment development:

```bash
# 1. Navigate to your environment directory
cd environments/my-env

# 2. Start development server with interactive testing
hud dev --build --interactive

# 3. Test tools interactively with arrow keys
# Make code changes and test again immediately

# 4. Alternative: Run dev server and test separately
# Terminal 1: Keep hud dev running
hud dev --build

# Terminal 2: Test your changes
hud analyze hud-my-env:dev

# 5. Production build
hud build  # Creates hud.lock.yaml

# 6. Share your environment
hud push
```

<Note>
  Using `hud dev --build` also generates or updates `hud.lock.yaml` with environment metadata, similar to `hud build`.
</Note>

## See Also

* [`hud init`](/reference/cli/init) - Create new environments
* [`hud build`](/reference/cli/build) - Build production images
* [`hud push`](/reference/cli/push) - Share to registry
* [`hud analyze`](/reference/cli/analyze) - Test tools after changes
* [`hud debug`](/reference/cli/debug) - Full environment validation
* [`hud run`](/reference/cli/run) - Run production environments
* [Build Environments](/build-environments) - Environment development guide


# hud init
Source: https://docs.hud.so/reference/cli/init

Create a new HUD environment with minimal boilerplate

The `hud init` command creates a new HUD environment from scratch with all the essential files needed to build, test, and deploy an MCP server.

## Usage

```bash
hud init [NAME] [OPTIONS]
```

## Arguments

<ParamField path="name" type="string">
  Environment name. If not provided, uses the current directory name.
</ParamField>

## Options

<ParamField path="--dir" type="string" default=".">
  Target directory where the environment will be created. Short: `-d`
</ParamField>

<ParamField path="--force" type="boolean" default="false">
  Overwrite existing files if they exist. Short: `-f`
</ParamField>

## What It Creates

The command generates a minimal but complete HUD environment structure:

```
my-env/
├── Dockerfile              # Container configuration
├── pyproject.toml          # Dependencies and metadata
├── README.md              # Basic documentation
└── src/
    └── hud_controller/
        ├── context.py     # Persistent state management
        └── server.py      # MCP server implementation
```

### Generated Files

<Tabs>
  <Tab title="server.py">
    Minimal MCP server with required tools:

    ```python
    from hud.server import MCPServer
    from hud.server.context import attach_context

    mcp = MCPServer(name="my-env")
    ctx = None

    @mcp.initialize
    async def init(init_ctx):
        global ctx
        ctx = attach_context("/tmp/hud_ctx.sock")

    @mcp.tool()
    async def act() -> str:
        """Perform an action."""
        return f"Action #{ctx.act()}"

    @mcp.tool()
    async def setup() -> str:
        """Required for HUD environments."""
        return "Ready"

    @mcp.tool()
    async def evaluate() -> dict:
        """Required for HUD environments."""
        return {"count": ctx.get_count()}

    if __name__ == "__main__":
        mcp.run()
    ```
  </Tab>

  <Tab title="context.py">
    Persistent state that survives hot-reloads:

    ```python
    from hud.server.context import run_context_server
    import asyncio

    class Context:
        def __init__(self):
            self.count = 0

        def act(self):
            self.count += 1
            return self.count

        def get_count(self):
            return self.count

    if __name__ == "__main__":
        asyncio.run(run_context_server(Context()))
    ```
  </Tab>

  <Tab title="Dockerfile">
    Production-ready container setup:

    ```dockerfile
    FROM python:3.10-slim
    RUN pip install uv
    WORKDIR /app
    COPY pyproject.toml .
    RUN uv pip install --system -e .
    COPY src/ src/
    CMD ["python", "-m", "hud_controller.server"]
    ```
  </Tab>

  <Tab title="pyproject.toml">
    Project configuration with HUD metadata:

    ```toml
    [project]
    name = "my-env"
    version = "0.1.0"
    dependencies = ["hud[sdk]"]

    [tool.hud]
    image = "hud-my-env:dev"
    ```
  </Tab>
</Tabs>

## Examples

### Basic Usage

Create an environment in a new directory:

```bash
hud init my-env
cd my-env
```

### Current Directory

Initialize in the current directory using its name:

```bash
mkdir awesome-tool && cd awesome-tool
hud init
```

### Custom Location

Create in a specific directory:

```bash
hud init my-env --dir ~/projects
cd ~/projects/my-env
```

### Overwrite Existing

Force overwrite existing files:

```bash
hud init my-env --force
```

## Next Steps

After creating your environment:

<Steps>
  <Step title="Start Development">
    Run with hot-reload and interactive testing:

    ```bash
    hud dev . --interactive
    ```
  </Step>

  <Step title="Add Your Logic">
    Edit `src/hud_controller/server.py` to add custom tools:

    ```python
    @mcp.tool()
    async def my_custom_tool(text: str) -> str:
        """Process some text."""
        return text.upper()
    ```
  </Step>

  <Step title="Test Your Tools">
    In interactive mode, use arrow keys to select and test tools
  </Step>

  <Step title="Build for Production">
    When ready, build and share:

    ```bash
    hud build
    hud push
    ```
  </Step>
</Steps>

## Key Concepts

### Required Tools

Every HUD environment must implement:

* `setup()` - Initialize the environment
* `evaluate()` - Return environment state for evaluation

### Context Pattern

The context pattern separates state from server logic:

* State persists across hot-reloads
* Server reconnects to existing context
* Enables rapid development iteration

### Minimal Dependencies

The template uses only essential dependencies:

* `hud[sdk]` - Core HUD functionality
* No unnecessary packages
* Production-ready from the start

## Tips

<Tip>
  The generated environment is immediately runnable with `hud dev`. No additional setup required!
</Tip>

<Note>
  Environment names are converted to valid Docker image names automatically (underscores → hyphens).
</Note>

<Warning>
  Use `--force` carefully as it will overwrite existing files without confirmation.
</Warning>

## See Also

* [hud dev](/reference/cli/dev) - Start development with hot-reload
* [hud build](/reference/cli/build) - Build production images
* [Build Environments Guide](/build-environments) - Complete environment development guide


# CLI Overview
Source: https://docs.hud.so/reference/cli/overview

Complete reference for HUD command-line tools

The HUD CLI provides a complete toolkit for creating, developing, and running MCP environments. Commands are organized into two main workflows:

<CardGroup cols={2}>
  <Card title="Building Environments" icon="hammer" color="#6b21a8">
    **Directory-based commands** for creating and sharing environments:

    * `hud init` - Create new environment
    * `hud dev` - Develop with hot-reload
    * `hud build` - Build and generate lock file
    * `hud push` - Share to registry
  </Card>

  <Card title="Running Environments" icon="play" color="#0891b2">
    **Target-based commands** for consuming environments:

    * `hud pull` - Get from registry
    * `hud analyze` - Inspect capabilities
    * `hud debug` - Test compliance
    * `hud run` - Execute locally/remotely
  </Card>
</CardGroup>

## Installation

<CodeGroup>
  ```bash uv (Recommended)
  uv tool install hud-python
  hud --version
  ```

  ```bash pip
  pip install hud-python
  python -m hud.cli --help
  ```

  ```bash pipx
  pipx install hud-python
  hud --version
  ```
</CodeGroup>

## Commands

### Building Workflow

| Command     | Input     | Description             | Example                   |
| ----------- | --------- | ----------------------- | ------------------------- |
| `hud init`  | Directory | Create new environment  | `hud init my-env`         |
| `hud dev`   | Directory | Hot-reload development  | `hud dev . --interactive` |
| `hud build` | Directory | Build image & lock file | `hud build . --tag v1.0`  |
| `hud push`  | Directory | Share to registry       | `hud push . --tag prod`   |

### Running Workflow

| Command       | Input  | Description                   | Example                   |
| ------------- | ------ | ----------------------------- | ------------------------- |
| `hud pull`    | Target | Get environment from registry | `hud pull org/env:latest` |
| `hud analyze` | Target | Inspect tools & capabilities  | `hud analyze org/env`     |
| `hud debug`   | Target | Test MCP compliance           | `hud debug my-env:latest` |
| `hud run`     | Target | Execute server                | `hud run my-env --local`  |

### Utilities

| Command           | Description                 | Example           |
| ----------------- | --------------------------- | ----------------- |
| `hud quickstart`  | Clone quickstart repository | `hud quickstart`  |
| `hud cursor-list` | List Cursor MCP servers     | `hud cursor-list` |
| `hud version`     | Show version info           | `hud version`     |

## Complete Workflows

### Building an Environment

<Steps>
  <Step title="Initialize">
    Create a new HUD environment with minimal boilerplate:

    ```bash
    hud init my-env && cd my-env
    ```

    Creates `Dockerfile`, `pyproject.toml`, and minimal MCP server with context.
  </Step>

  <Step title="Develop">
    Run with hot-reload and interactive testing:

    ```bash
    hud dev
    ```

    Your changes reload automatically. Test tools interactively with arrow keys.
  </Step>

  <Step title="Build">
    Create production image and lock file:

    ```bash
    hud build
    ```

    Generates `hud.lock.yaml` with metadata and labels image for reproducibility.
  </Step>

  <Step title="Push">
    Share to Docker and HUD registries:

    ```bash
    hud push
    ```

    Requires `HUD_API_KEY`. Auto-detects registry from Docker login.
  </Step>
</Steps>

### Running an Environment

<Steps>
  <Step title="Pull">
    Get environment with metadata preview:

    ```bash
    hud pull hudpython/text_init:latest
    ```

    Shows tools, env vars, and build info before downloading.
  </Step>

  <Step title="Analyze">
    Quick inspection without running:

    ```bash
    hud analyze hudpython/text_init    # Fast (from metadata)
    hud analyze hudpython/text_init --live  # Full (runs container)
    ```
  </Step>

  <Step title="Debug">
    Test MCP protocol compliance:

    ```bash
    hud debug hudpython/text_init:latest
    ```

    Validates through 5 phases of initialization.
  </Step>

  <Step title="Run">
    Execute in production mode:

    ```bash
    hud run hudpython/text_init:latest        # Remote (default)
    hud run hudpython/text_init:latest --local  # Local Docker
    ```
  </Step>
</Steps>

## Common Usage

### Docker Images

```bash
# Basic
hud debug my-image:latest

# With options
hud debug my-image:latest -e DEBUG=true -p 8080:8080

# From registry
hud analyze ghcr.io/org/image:v1.0.0
```

### Local Commands

```bash
# Python script
hud debug --command "python server.py"

# Node.js server
hud debug --command "node mcp-server.js"
```

### Cursor Integration

```bash
# Debug Cursor server
hud debug --cursor my-dev-server

# List all servers
hud cursor-list
```

## Output Formats

### Interactive (Default)

```bash
hud analyze my-env

🔍 Analyzing MCP environment: my-env
✓ Connected successfully
✓ Found 12 tools

Available Tools:
  • click - Click at coordinates
  • type - Type text
  ...
```

### JSON

```bash
hud analyze my-env --format json

{
  "tools": [{
    "name": "click",
    "description": "Click at coordinates",
    "parameters": {...}
  }]
}
```

### Markdown

```bash
hud analyze my-env --format markdown > docs/tools.md
```

## CI/CD Example

```bash
#!/bin/bash
set -e

# Test environment
hud debug "$IMAGE_NAME"

# Verify tools
TOOLS=$(hud analyze "$IMAGE_NAME" --format json | jq '.tools | length')
if [ "$TOOLS" -lt 3 ]; then
  echo "Not enough tools!"
  exit 1
fi
```

## Python Scripting

```python
import subprocess
import json

def get_tools(image):
    result = subprocess.run(
        ["hud", "analyze", image, "--format", "json"],
        capture_output=True,
        text=True,
        check=True
    )
    return json.loads(result.stdout)["tools"]

# Use
tools = get_tools("my-env:latest")
for tool in tools:
    print(f"- {tool['name']}: {tool['description']}")
```

## Exit Codes

| Code | Meaning          | Description         |
| ---- | ---------------- | ------------------- |
| 0    | Success          | Command completed   |
| 1    | General Error    | Command failed      |
| 2    | Usage Error      | Invalid arguments   |
| 3    | Connection Error | Failed to connect   |
| 4    | Timeout          | Operation timed out |
| 5    | Protocol Error   | MCP violation       |

## Environment Variables

```bash
# Debug output
export HUD_CLI_DEBUG=true

# Custom timeout
export HUD_CLI_TIMEOUT=120

# Provider keys
export ANCHOR_API_KEY=...
```

## Next Steps

### Building Commands

<CardGroup cols={2}>
  <Card title="Init Command" icon="sparkles" href="/reference/cli/init">
    Create new environments from scratch
  </Card>

  <Card title="Dev Command" icon="code" href="/reference/cli/dev">
    Develop with hot-reload and interactive testing
  </Card>

  <Card title="Build Command" icon="hammer" href="/reference/cli/build">
    Build images and generate lock files
  </Card>

  <Card title="Push Command" icon="upload" href="/reference/cli/push">
    Share environments to registry
  </Card>
</CardGroup>

### Running Commands

<CardGroup cols={2}>
  <Card title="Pull Command" icon="download" href="/reference/cli/pull">
    Get environments with metadata preview
  </Card>

  <Card title="Analyze Command" icon="magnifying-glass" href="/reference/cli/analyze">
    Inspect tools and capabilities
  </Card>

  <Card title="Debug Command" icon="bug" href="/reference/cli/debug">
    Test MCP protocol compliance
  </Card>

  <Card title="Run Command" icon="play" href="/reference/cli/run">
    Execute servers locally or remotely
  </Card>
</CardGroup>


# hud pull
Source: https://docs.hud.so/reference/cli/pull

Get HUD environments from registry with metadata preview

The `hud pull` command fetches HUD environments from registries, showing a preview of tools and capabilities before downloading the Docker image.

## Usage

```bash
hud pull <TARGET> [OPTIONS]
```

## Arguments

<ParamField path="target" type="string" required>
  Image reference (e.g., `org/env:tag`) or path to lock file
</ParamField>

## Options

<ParamField path="--lock" type="string">
  Path to save lock file (if target is image ref). Short: `-l`
</ParamField>

<ParamField path="--yes" type="boolean" default="false">
  Skip confirmation prompt. Short: `-y`
</ParamField>

<ParamField path="--verify-only" type="boolean" default="false">
  Only verify metadata without pulling image
</ParamField>

<ParamField path="--verbose" type="boolean" default="false">
  Show detailed output. Short: `-v`
</ParamField>

## How It Works

<Steps>
  <Step title="Fetch Metadata">
    Retrieves environment information from:

    * HUD registry (if available)
    * Docker manifest (fallback)
    * Local lock file (if provided)
  </Step>

  <Step title="Display Preview">
    Shows environment details:

    * Available tools and descriptions
    * Build information
    * Required environment variables
    * Image size
  </Step>

  <Step title="Confirm Download">
    Prompts for confirmation before pulling (unless `--yes`)
  </Step>

  <Step title="Pull Image">
    Downloads Docker image using `docker pull`
  </Step>

  <Step title="Save Lock File">
    Stores metadata locally for offline access
  </Step>
</Steps>

## Examples

### Basic Pull

Pull from HUD registry:

```bash
hud pull hudpython/text-analyzer:latest
```

Output:

```
🚀 HUD Environment Pull

Fetching metadata for: hudpython/text-analyzer:latest

📦 Environment Details
┌─────────────┬──────────────────────────┐
│ Image       │ hudpython/text-analyzer  │
│ Source      │ HUD Registry             │
│ Built       │ 2024-01-15T10:30:00Z     │
│ HUD Version │ 0.1.0                    │
│ Size        │ 156 MB                   │
│ Tools       │ 8                        │
└─────────────┴──────────────────────────┘

🔧 Available Tools
├── setup - Initialize environment
├── evaluate - Return environment state  
├── analyze_text - Analyze text sentiment
├── extract_entities - Extract named entities
└── ... 4 more tools

🔑 Environment Variables
├── Required: OPENAI_API_KEY
└── Optional: DEBUG, LOG_LEVEL

Pull hudpython/text-analyzer:latest? [y/N]: y
```

### Skip Confirmation

Auto-pull for scripts:

```bash
hud pull myorg/myenv:v1.0.0 --yes
```

### Verify Only

Check metadata without downloading:

```bash
hud pull myorg/myenv:latest --verify-only
```

### From Lock File

Pull using a shared lock file:

```bash
# Someone shares their hud.lock.yaml
hud pull ./shared-env.lock.yaml
```

### Custom Lock Location

Save lock file to specific path:

```bash
hud pull myorg/myenv:latest --lock ~/.hud/locks/myenv.yaml
```

## Metadata Sources

The command tries these sources in order:

1. **HUD Registry** (Fastest)
   * Full tool descriptions
   * Build metadata
   * Environment requirements

2. **Docker Registry** (Limited)
   * Basic image info
   * Size and layers
   * No tool details

3. **Lock File** (Complete)
   * Full environment snapshot
   * Reproducible reference
   * Offline capable

## Lock File Storage

Downloaded lock files are saved to:

```
~/.hud/locks/{org}_{name}_{tag}.yaml
```

Examples:

* `hudpython_text-analyzer_latest.yaml`
* `myorg_browser-env_v1.0.0.yaml`

<Tip>
  Lock files are cached locally, so repeated `hud analyze` commands are instant.
</Tip>

## Next Steps

After pulling, the command shows:

```
✅ Image pulled successfully
✅ Lock file saved to: ~/.hud/locks/hudpython_text-analyzer_latest.yaml

Next Steps:
  1. Quick analysis:    hud analyze hudpython/text-analyzer
  2. Live analysis:     hud analyze hudpython/text-analyzer --live
  3. Run locally:       hud run hudpython/text-analyzer:latest
```

## Registry Access

### Public Environments

No authentication needed:

```bash
# HUD's public environments
hud pull hudpython/code-analyzer:latest
hud pull hudpython/web-browser:latest
```

### Private Registries

Requires Docker login:

```bash
# Login first
docker login private-registry.com

# Then pull
hud pull private-registry.com/team/tool:latest
```

### API Key Benefits

With `HUD_API_KEY` set:

* Access private HUD metadata
* Faster metadata retrieval
* Additional environment details

```bash
export HUD_API_KEY="your-key"
hud pull myorg/private-env:latest
```

## Troubleshooting

### Common Issues

<AccordionGroup>
  <Accordion title="Metadata not found">
    The environment may not have been pushed to HUD registry:

    * Metadata will be limited to Docker manifest info
    * Ask the author to run `hud push`
    * Use `--verbose` to see fallback behavior
  </Accordion>

  <Accordion title="Docker pull failed">
    Check Docker authentication:

    ```bash
    # Verify login
    docker login

    # Test manual pull
    docker pull myorg/myenv:latest
    ```
  </Accordion>

  <Accordion title="Slow metadata fetch">
    HUD registry may be slow or unavailable:

    ```bash
    # Skip HUD registry, use Docker only
    hud pull myorg/myenv:latest --verbose
    ```
  </Accordion>
</AccordionGroup>

### Network Issues

<Warning>
  Behind a proxy? Configure Docker:

  ```bash
  export HTTP_PROXY=http://proxy:8080
  export HTTPS_PROXY=http://proxy:8080
  ```
</Warning>

## Integration Examples

### CI/CD Pipeline

```yaml
# GitHub Actions
- name: Pull HUD environment
  run: |
    hud pull myorg/test-env:latest --yes
    hud analyze myorg/test-env:latest
```

### Docker Compose

```yaml
services:
  mcp-env:
    image: myorg/myenv:latest
    environment:
      - API_KEY=${API_KEY}
```

### Scripted Setup

```bash
#!/bin/bash
# Pull multiple environments
ENVS=("web-browser" "code-analyzer" "data-processor")

for env in "${ENVS[@]}"; do
  echo "Pulling $env..."
  hud pull "hudpython/$env:latest" --yes
done
```

## Best Practices

1. **Version Pinning**: Use specific tags
   ```bash
   hud pull myorg/tool:v1.2.3  # Good
   hud pull myorg/tool:latest  # Risky
   ```

2. **Cache Lock Files**: Share lock files for reproducibility
   ```bash
   # Save to project
   hud pull myorg/env:v1 --lock ./env.lock.yaml
   git add env.lock.yaml
   ```

3. **Verify First**: Check before downloading large images
   ```bash
   hud pull myorg/large-env:latest --verify-only
   ```

4. **Document Requirements**: Note any required env vars
   ```bash
   # After pull, check requirements
   hud analyze myorg/env:latest
   ```

## See Also

* [`hud push`](/reference/cli/push) - Share environments
* [`hud analyze`](/reference/cli/analyze) - Inspect pulled environments
* [`hud run`](/reference/cli/run) - Execute pulled environments
* [`hud debug`](/reference/cli/debug) - Test pulled environments
* [Build Environments](/build-environments/registry) - Registry guide


# hud push
Source: https://docs.hud.so/reference/cli/push

Share HUD environments to Docker and HUD registries

The `hud push` command publishes your HUD environment to both Docker registry and HUD registry, making it available for others to use with `hud pull`.

## Usage

```bash
hud push [DIRECTORY] [OPTIONS]
```

## Arguments

<ParamField path="directory" type="string" default=".">
  Environment directory containing hud.lock.yaml
</ParamField>

## Options

<ParamField path="--image" type="string">
  Override registry image name (e.g., `myorg/myenv`). Short: `-i`
</ParamField>

<ParamField path="--tag" type="string">
  Override tag (e.g., `v1.0`, `latest`). Short: `-t`
</ParamField>

<ParamField path="--sign" type="boolean" default="false">
  Sign the image with cosign (not yet implemented)
</ParamField>

<ParamField path="--yes" type="boolean" default="false">
  Skip confirmation prompts. Short: `-y`
</ParamField>

<ParamField path="--verbose" type="boolean" default="false">
  Show detailed output. Short: `-v`
</ParamField>

## Prerequisites

<Warning>
  `hud push` requires an API key. Set it via environment variable:

  ```bash
  export HUD_API_KEY="your-api-key"
  ```
</Warning>

<Note>
  You must be logged in to a Docker registry:

  ```bash
  docker login
  # or
  docker login ghcr.io
  ```
</Note>

## What It Does

<Steps>
  <Step title="Validate Lock File">
    Reads `hud.lock.yaml` to get image reference and metadata
  </Step>

  <Step title="Tag for Registry">
    Creates registry tag from local image:

    * Uses `--image` if provided
    * Auto-detects from Docker config
    * Defaults to Docker Hub format
  </Step>

  <Step title="Push to Docker">
    Pushes image to Docker registry and gets digest
  </Step>

  <Step title="Update Lock File">
    Updates `hud.lock.yaml` with:

    * Registry image reference with digest
    * Push timestamp
    * Registry information
  </Step>

  <Step title="Upload Metadata">
    Sends lock file to HUD registry at:
    `{HUD_API_URL}/registry/envs/{org}/{name}:{tag}`
  </Step>
</Steps>

## Registry Detection

The command intelligently detects your registry:

1. **Explicit**: Use `--image` flag
   ```bash
   hud push . --image ghcr.io/myorg/myenv
   ```

2. **Docker Config**: Reads from `~/.docker/config.json`
   ```bash
   docker login ghcr.io
   hud push  # Auto-uses ghcr.io
   ```

3. **Default**: Docker Hub format
   ```bash
   hud push  # Creates: yourusername/envname
   ```

## Examples

### Basic Push

Push to default registry:

```bash
# After hud build
hud push
```

Output:

```
🚀 HUD Environment Push

📋 Local Details
Image: hud-my-env:latest
Lock: hud.lock.yaml (2.3 KB)
Tools: 5 tools available

🎯 Registry Target
Local: hud-my-env:latest
Registry: myusername/my-env:latest

Push to registry? [y/N]: y

✅ Tagged image
✅ Pushed to registry: myusername/my-env@sha256:abc123...
✅ Updated lock file
✅ Metadata uploaded to HUD registry
```

### Custom Registry

Push to GitHub Container Registry:

```bash
hud push . --image ghcr.io/myorg/my-tool --tag v1.0.0
```

### CI/CD Automation

Skip prompts for automation:

```bash
hud push . --yes --tag "prod-$(date +%Y%m%d)"
```

### Override Tag

Push with different tag:

```bash
hud push . --tag experimental
```

## Lock File Updates

After pushing, `hud.lock.yaml` gains registry information:

```yaml
version: "1.0"
image: "myorg/myenv:latest@sha256:abc123..."
build:
  generatedAt: "2024-01-15T10:30:00Z"
  hudVersion: "0.1.0"
  directory: "my-env"
push:
  source: "hud-my-env:latest"
  pushedAt: "2024-01-15T11:00:00Z"
  registry: "ghcr.io"
environment:
  # ... environment details
tools:
  # ... tool information
```

## Sharing Workflows

### Public Sharing

After pushing, others can use:

```bash
# Pull with metadata preview
hud pull myorg/myenv:latest

# Use in MCP configs
{"hud": {"registry": "myorg/myenv:latest"}}

# Direct Docker usage
docker run myorg/myenv:latest
```

### Private Registries

For private registries, users need:

1. Docker registry access:
   ```bash
   docker login private-registry.com
   ```

2. HUD API key (if using private HUD registry):
   ```bash
   export HUD_API_KEY="shared-key"
   ```

## Usage Messages

After successful push, helpful usage examples are shown:

```
📦 Usage:

Test locally:
  hud run myorg/myenv:latest

Use in MCP configs:
  
  Claude Desktop:
    {"docker": {"image": "myorg/myenv@sha256:...", "command": "auto"}}
  
  Via HUD (recommended):
    {"hud": {"registry": "myorg/myenv:latest"}}

Share environment:
  Others can now pull with: hud pull myorg/myenv:latest
```

## Troubleshooting

### Authentication Issues

<AccordionGroup>
  <Accordion title="Docker login required">
    ```bash
    # Docker Hub
    docker login

    # GitHub Registry
    docker login ghcr.io -u USERNAME

    # Custom registry
    docker login my-registry.com
    ```
  </Accordion>

  <Accordion title="HUD API key missing">
    ```bash
    # Set API key
    export HUD_API_KEY="your-key-here"

    # Or create .env file
    echo "HUD_API_KEY=your-key-here" > .env
    ```
  </Accordion>

  <Accordion title="Permission denied">
    Ensure you have push access to the registry namespace:

    * Docker Hub: Must own the namespace
    * GitHub: Must have package write permissions
    * Private: Check with your admin
  </Accordion>
</AccordionGroup>

### Push Failures

<Tip>
  For large images, push may timeout. Use `--verbose` to see progress:

  ```bash
  hud push . --verbose
  ```
</Tip>

<Warning>
  If metadata upload fails, the Docker image is still pushed. You can retry metadata upload or share `hud.lock.yaml` manually.
</Warning>

## Best Practices

1. **Version Tags**: Use semantic versioning
   ```bash
   hud push . --tag v1.0.0
   hud push . --tag v1.1.0-beta
   ```

2. **Namespace Organization**: Group related environments
   ```bash
   hud push . --image myorg/tools/browser
   hud push . --image myorg/tools/scraper
   ```

3. **Documentation**: Update README with pull instructions
   ````markdown
   ## Installation
   ```bash
   hud pull myorg/myenv:latest
   ````
   ```
   ```

4. **CI Integration**: Automate pushes
   ```yaml
   - name: Push HUD environment
     env:
       HUD_API_KEY: ${{ secrets.HUD_API_KEY }}
     run: hud push . --yes --tag ${{ github.ref_name }}
   ```

## Security

<Note>
  The HUD registry only stores metadata (lock files), not code or images. Actual images are stored in Docker registries with their existing security models.
</Note>

* **API Keys**: Keep `HUD_API_KEY` secret
* **Registry Auth**: Use Docker's credential helpers
* **Image Signing**: Coming soon with `--sign` flag
* **Private Envs**: Use private Docker registries

## See Also

* [`hud build`](/reference/cli/build) - Build before pushing
* [`hud pull`](/reference/cli/pull) - Get shared environments
* [`hud analyze`](/reference/cli/analyze) - Inspect pushed environments
* [Build Environments](/build-environments/registry) - Registry guide


# hud run
Source: https://docs.hud.so/reference/cli/run

Execute MCP environments locally or remotely for production use

The `hud run` command executes MCP environments either locally via Docker or remotely via the HUD platform. Unlike `hud dev` which is for development, `hud run` is optimized for production execution.

## Usage

```bash
hud run <TARGET> [OPTIONS] [-- DOCKER_ARGS...]
```

## Arguments

<ParamField path="target" type="string" required>
  Docker image reference (e.g., `myorg/env:latest`)
</ParamField>

<ParamField path="docker_args" type="array">
  Additional Docker arguments after `--` (e.g., `-e KEY=value`)
</ParamField>

<Warning>
  Always use `--` before Docker arguments to avoid conflicts with HUD options:

  * ✅ `hud run image:tag -- -e KEY=value`
  * ❌ `hud run image:tag -e KEY=value`
</Warning>

## Options

<ParamField path="--local" type="boolean" default="false">
  Run locally with Docker instead of remote execution
</ParamField>

<ParamField path="--remote" type="boolean" default="true">
  Explicitly run remotely (default behavior)
</ParamField>

<ParamField path="--transport" type="string" default="stdio">
  Transport protocol: `stdio` or `http`. Short: `-t`
</ParamField>

<ParamField path="--port" type="integer" default="8765">
  Port for HTTP transport (ignored for stdio). Short: `-p`
</ParamField>

<ParamField path="--url" type="string" default="https://mcp.hud.so/v3/mcp">
  Remote MCP server URL
</ParamField>

<ParamField path="--api-key" type="string">
  API key for remote server (defaults to `HUD_API_KEY` env var)
</ParamField>

<ParamField path="--run-id" type="string">
  Run ID for tracking (remote only)
</ParamField>

<ParamField path="--verbose" type="boolean" default="false">
  Show detailed output. Short: `-v`
</ParamField>

## Remote Execution (Default)

By default, `hud run` executes your Docker image on the HUD platform's remote infrastructure. This provides:

* **Scalability**: Run hundreds of concurrent instances
* **Resource Management**: Automatic container lifecycle management
* **Monitoring**: Live telemetry and performance metrics
* **Geographic Distribution**: Global edge deployment

### Remote Examples

```bash
# Basic remote execution
hud run hudpython/text-2048:latest

# With environment variables
hud run myorg/server:v1 -- -e API_KEY=secret -e DEBUG=true

# HTTP transport with custom port
hud run myorg/server:v1 --transport http --port 9000

# Custom remote URL and API key
hud run my-image:latest --url https://custom.mcp.server --api-key my-key

# With run tracking
hud run production-env:v1.0 --run-id "deploy-$(date +%Y%m%d)"
```

### Authentication

Remote execution requires authentication via API key:

```bash
# Set environment variable (recommended)
export HUD_API_KEY=your-api-key-here
hud run my-image:latest

# Or pass directly (less secure)
hud run my-image:latest --api-key your-api-key-here
```

### Run Tracking

Use `--run-id` to track specific execution instances:

```bash
hud run my-image:latest --run-id "experiment-001"
```

## Local Execution

Use `--local` to run the image locally with Docker:

```bash
# Basic local execution
hud run hudpython/text-2048:latest --local

# With Docker arguments
hud run myorg/server:v1 --local -- -e API_KEY=secret -v /data:/app/data

# HTTP transport locally
hud run myorg/server:v1 --local --transport http --port 8080

# With memory limits and volumes
hud run my-env:latest --local -- --memory 2g -v $(pwd)/data:/app/data
```

Local execution is useful for:

* Development and testing
* Environments without internet access
* Custom Docker configurations
* Debugging before remote deployment

## Transport Modes

### Stdio Transport (Default)

```bash
hud run my-image:latest --transport stdio
```

**Characteristics:**

* Direct JSON-RPC over stdin/stdout
* Single client connection
* Lower latency
* MCP protocol native

### HTTP Transport

```bash
hud run my-image:latest --transport http --port 8765
```

**Characteristics:**

* HTTP proxy at specified port
* Multiple concurrent connections
* Web-compatible
* Inspector support

## Docker Arguments

Pass Docker arguments after `--` separator:

```bash
# Environment variables
hud run my-image:latest -- -e DEBUG=true -e API_KEY=secret

# Volume mounts
hud run my-image:latest -- -v /host/data:/container/data

# Network settings
hud run my-image:latest -- --network host

# Memory limits
hud run my-image:latest -- --memory 2g

# Combined arguments
hud run my-image:latest -- -e API_KEY=secret -v /data:/app/data --memory 1g
```

<Note>
  The `--` separator is required to distinguish HUD options from Docker arguments.
</Note>

## Environment Variables

| Variable      | Description                       | Default                     |
| ------------- | --------------------------------- | --------------------------- |
| `HUD_MCP_URL` | Remote MCP server URL             | `https://mcp.hud.so/v3/mcp` |
| `HUD_API_KEY` | API key for remote authentication | *required for remote*       |

## Comparison with Other Commands

| Command     | Purpose                       | Hot-reload | Target Use Case               |
| ----------- | ----------------------------- | ---------- | ----------------------------- |
| `hud dev`   | Development with auto-restart | ✅ Yes      | Local development iteration   |
| `hud run`   | Production execution          | ❌ No       | Production workloads, testing |
| `hud debug` | Environment validation        | ❌ No       | Debugging and validation      |

## Examples

### Basic Usage

```bash
# Run remotely (default)
hud run hudpython/text-2048:latest

# Run locally for testing
hud run hudpython/text-2048:latest --local

# Verbose output for debugging
hud run my-image:latest --local --verbose
```

### Production Deployment

```bash
# Remote execution with tracking
export HUD_API_KEY=your-production-key
hud run production-env:v1.0 --run-id "prod-deploy-$(date +%s)"

# With environment configuration
hud run production-env:v1.0 --run-id "prod-$(git rev-parse --short HEAD)" -- \
  -e DATABASE_URL=postgres://... \
  -e REDIS_URL=redis://... \
  -e LOG_LEVEL=info
```

### Development Testing

```bash
# Test locally before remote deployment
hud run my-env:dev --local -- -e DEBUG=true

# Test with HTTP transport
hud run my-env:dev --local --transport http --port 9000

# Test with custom data volume
hud run my-env:dev --local -- -v $(pwd)/test-data:/app/data
```

### CI/CD Integration

```yaml
# GitHub Actions example
- name: Run HUD environment
  env:
    HUD_API_KEY: ${{ secrets.HUD_API_KEY }}
  run: |
    # Pull latest
    hud pull myorg/env:latest --yes
    
    # Run tests
    hud run myorg/env:latest --run-id "ci-${{ github.sha }}"
```

## Error Handling

`hud run` provides clear error messages for common issues:

* **Missing image**: Suggests available images or build commands
* **Authentication failure**: Guides through API key setup
* **Port conflicts**: Suggests alternative ports
* **Docker errors**: Shows Docker-specific troubleshooting

## See Also

* [`hud pull`](/reference/cli/pull) - Get environments before running
* [`hud analyze`](/reference/cli/analyze) - Inspect capabilities first
* [`hud debug`](/reference/cli/debug) - Test environment compliance
* [`hud dev`](/reference/cli/dev) - Development with hot-reload
* [Build Environments](/build-environments) - Environment development guide


# Environments
Source: https://docs.hud.so/reference/environments

SDK reference for building MCP environments

The HUD SDK provides `MCPServer` for building MCP-compatible environments that work with any MCP client.

## MCPServer

```python
from hud.server import MCPServer
```

Enhanced FastMCP server with Docker-friendly features for building HUD environments.

**Constructor Parameters:**

| Parameter          | Type  | Description                     | Default  |
| ------------------ | ----- | ------------------------------- | -------- |
| `name`             | `str` | Server name for MCP handshake   | Required |
| `instructions`     | `str` | Server instructions/description | `None`   |
| `**fastmcp_kwargs` | `Any` | Additional FastMCP parameters   | -        |

**Key Features:**

1. **SIGTERM handling** - Graceful shutdown in containers via custom runner
2. **Initialize decorator** - Async setup during MCP initialize request
3. **Shutdown decorator** - Cleanup after lifespan context ends
4. **Enhanced add\_tool()** - Automatically handles BaseTool instances
5. **FastMCP inheritance** - All FastMCP methods available (`mount`, `resource`, `tool`)

### Decorators

#### @initialize

Run async setup during MCP initialize request:

```python
mcp = MCPServer(name="my-env")

@mcp.initialize
async def setup_environment(ctx):
    """
    Initialize environment resources.
    
    Args:
        ctx: RequestContext with:
            - ctx.meta: Client metadata dict
            - ctx.session: MCP ServerSession
    """
    # Access metadata from agent (if provided)
    if ctx.meta:
        progress_token = ctx.meta.get("progressToken")
        display_width = ctx.meta.get("display_width", 1920)
        display_height = ctx.meta.get("display_height", 1080)
        
        # Send progress notifications
        if progress_token:
            await ctx.session.send_progress_notification(
                progress_token=progress_token,
                progress=50,
                total=100,
                message="Initializing environment..."
            )
```

#### @shutdown

Run cleanup after server lifespan ends:

```python
@mcp.shutdown
async def cleanup():
    """Clean up resources on shutdown."""
    if browser_provider:
        browser_provider.close()
    logger.info("Cleanup complete")
```

### Tool Registration

Three ways to register tools:

```python
# 1. Decorator for simple functions
@mcp.tool()
async def my_tool(param: str) -> dict:
    return {"result": param}

# 2. Add BaseTool instances
from hud.tools import BashTool
bash = BashTool()
mcp.add_tool(bash)  # Automatically uses bash.mcp internally

# 3. Add non-BaseTool instances directly
from custom import PlaywrightTool
playwright = PlaywrightTool()
mcp.add_tool(playwright)  # Added as-is
```

### Hub Pattern (mount)

Use BaseHub for organized tool namespaces:

```python
from hud.tools import BaseHub

# Create hub
setup_hub = BaseHub("setup")

# Add internal tools (hidden from agents)
@setup_hub.tool("board")
async def setup_board(size: int = 4):
    game = setup_hub.env
    game.reset(size=size)
    return [TextContent(text=f"{size}x{size} board initialized")]

# Mount hub on server
mcp.mount(setup_hub)

# Agents call via dispatcher: setup(name="board", arguments={"size": 4})
```

### Resources

Expose metadata via MCP resources:

```python
@mcp.resource("telemetry://live")
async def get_telemetry():
    """Expose live telemetry data."""
    return {
        "provider": os.getenv("BROWSER_PROVIDER"),
        "status": "running" if browser_provider else "stopped",
        "live_url": browser_provider.get_live_view_url() if browser_provider else None,
        "timestamp": datetime.now().isoformat()
    }
```

### Running the Server

```python
if __name__ == "__main__":
    # Run with SIGTERM handling (stdio by default)
    mcp.run()
    
    # Or specify transport
    mcp.run(transport="sse", port=8080)
```

## Real Environment Examples

### Minimal Environment

```python
# src/hud_controller/server.py
from hud.server import MCPServer
from mcp.types import TextContent

mcp = MCPServer(name="counter-env")
counter = {"value": 0}

@mcp.tool()
async def setup(start_value: int = 0):
    """Initialize counter."""
    counter["value"] = start_value
    return {"status": "ready", "counter": counter["value"]}

@mcp.tool()
async def increment():
    """Increment counter."""
    counter["value"] += 1
    return [TextContent(text=f"Counter: {counter['value']}", type="text")]

@mcp.tool()
async def evaluate(target: int):
    """Check if target reached."""
    from hud.tools.types import EvaluationResult
    return EvaluationResult(
        reward=1.0 if counter["value"] >= target else 0.0,
        done=counter["value"] >= target
    )

if __name__ == "__main__":
    mcp.run()
```

### text\_2048 Environment

From `environments/text_2048/src/hud_controller/server.py`:

```python
from hud.server import MCPServer
from .game import Game2048
from .tools import MoveTool
from .setup import setup as setup_hub
from .evaluate import evaluate as evaluate_hub

mcp = MCPServer(name="text-2048")
game = None

@mcp.initialize
async def initialize_environment(ctx):
    global game
    
    # Progress notifications
    progress_token = getattr(ctx.meta, "progressToken", None) if ctx.meta else None
    
    async def send_progress(progress: int, message: str):
        if progress_token:
            await ctx.session.send_progress_notification(
                progress_token=progress_token,
                progress=progress,
                total=100,
                message=message
            )
    
    await send_progress(0, "Starting 2048 game environment...")
    
    # Create game
    game = Game2048()
    game.reset()
    
    await send_progress(50, "Setting up game board...")
    
    # Set game on hubs
    setup_hub.env = game
    evaluate_hub.env = game
    
    # Mount hubs
    mcp.mount(setup_hub)
    mcp.mount(evaluate_hub)
    
    await send_progress(70, "Configuring tools...")
    
    # Add move tool
    mcp.add_tool(MoveTool(env=game))
    
    await send_progress(100, "2048 environment ready")
```

### remote\_browser Environment

From `environments/remote_browser/src/hud_controller/server.py`:

```python
from hud.server import MCPServer
from hud.tools.computer import HudComputerTool, AnthropicComputerTool, OpenAIComputerTool
from .tools import PlaywrightToolWithMemory, BrowserExecutor
from .setup import setup as setup_hub
from .evaluate import evaluate as evaluate_hub
from .providers import get_provider

mcp = MCPServer(
    name="HUD Remote Browser Environment",
    instructions="""Remote browser automation environment..."""
)

# Global state
browser_provider = None
playwright_tool = None

@mcp.resource("telemetry://live")
async def get_telemetry_resource():
    """MCP resource with live browser status."""
    return {
        "provider": os.getenv("BROWSER_PROVIDER", "unknown"),
        "status": "running" if browser_provider else "stopped",
        "live_url": browser_provider.get_live_view_url() if browser_provider else None,
        "cdp_url": browser_provider.cdp_url if browser_provider else None
    }

@mcp.initialize
async def initialize_environment(ctx):
    global browser_provider, playwright_tool
    
    # Get metadata
    metadata = ctx.meta
    progress_token = metadata.get("progressToken", None)
    
    # Initialize provider
    provider_name = os.getenv("BROWSER_PROVIDER")
    provider_class = get_provider(provider_name)
    browser_provider = provider_class(config)
    
    # Launch browser
    cdp_url = await browser_provider.launch()
    
    # Create playwright tool
    playwright_tool = PlaywrightToolWithMemory(cdp_url=cdp_url)
    await playwright_tool._ensure_browser()
    
    # Add playwright tool (not a BaseTool, added directly)
    mcp.add_tool(playwright_tool)
    
    # Create computer tools
    executor = BrowserExecutor(playwright_tool)
    tool_kwargs = {"executor": executor}
    
    # Add display dimensions from metadata
    if metadata:
        width = metadata.get("display_width")
        height = metadata.get("display_height")
        if width and height:
            tool_kwargs["width"] = width
            tool_kwargs["height"] = height
    
    # Add computer tools (all are BaseTool subclasses)
    mcp.add_tool(HudComputerTool(**tool_kwargs))
    mcp.add_tool(AnthropicComputerTool(**tool_kwargs))
    mcp.add_tool(OpenAIComputerTool(**tool_kwargs))
    
    # Mount hubs
    setup_hub.env = playwright_tool
    evaluate_hub.env = playwright_tool
    mcp.mount(setup_hub)
    mcp.mount(evaluate_hub)

@mcp.shutdown
async def shutdown_environment():
    """Cleanup browser resources."""
    global browser_provider
    if browser_provider:
        browser_provider.close()
    browser_provider = None
```

## Standard Structure

### Directory Layout

```
my-environment/
├── Dockerfile
├── pyproject.toml
├── README.md
└── src/
    └── hud_controller/          # Standard package name
        ├── __init__.py
        ├── server.py            # Entry point with MCPServer
        ├── context.py           # Environment state (optional)
        ├── setup/               # Setup hub (optional)
        │   ├── __init__.py      # Creates BaseHub("setup")
        │   └── *.py             # Setup functions
        ├── evaluate/            # Evaluate hub (optional)
        │   ├── __init__.py      # Creates BaseHub("evaluate")
        │   └── *.py             # Evaluator functions
        └── tools/               # Custom tools (optional)
            └── *.py             # Tool implementations
```

### Dockerfile

```dockerfile
FROM python:3.11-slim

WORKDIR /app

# Install dependencies
COPY pyproject.toml ./
RUN pip install -e .

# Copy source
COPY src/ ./src/

# Critical: stdio transport for MCP
CMD ["python", "-m", "hud_controller.server"]
```

### Hub Module Pattern

Example from text\_2048:

```python
# src/hud_controller/setup/__init__.py
from hud.tools.base import BaseHub

setup = BaseHub("setup")

# Import all setup functions to register them
from . import board

__all__ = ["setup"]

# src/hud_controller/setup/board.py
from . import setup

@setup.tool("board")
async def setup_board(board_size: int = 4):
    """Initialize game board."""
    game = setup.env  # Access environment from hub
    game.reset(size=board_size)
    return [TextContent(text=f"{board_size}x{board_size} game initialized")]
```

## Key Concepts

### Environment State

Three patterns for managing state:

1. **Global variables** (simple environments):
   ```python
   game = None

   @mcp.initialize
   async def initialize_environment(ctx):
       global game
       game = Game2048()
   ```

2. **Context class** (complex environments):
   ```python
   class EnvironmentContext:
       def __init__(self):
           self.browser = None
           self.page = None

   env = EnvironmentContext()
   ```

3. **Hub env attribute** (for tool access):
   ```python
   setup_hub.env = game  # Tools access via hub.env
   ```

### Tool Lifecycle

1. **Setup tools** - Hidden from agents, prepare environment state
2. **Interaction tools** - Available to agents for control
3. **Evaluate tools** - Hidden from agents, score performance

### Progress Notifications

Send [progress updates](https://modelcontextprotocol.io/specification/basic/utilities/progress) during long-running operations:

```python
async def send_progress(progress: int, message: str):
    if progress_token:
        await ctx.session.send_progress_notification(
            progress_token=progress_token,
            progress=progress,
            total=100,
            message=message
        )
```

<Info>
  Progress notifications follow the [MCP progress specification](https://modelcontextprotocol.io/specification/basic/utilities/progress#progress-flow). The `progressToken` comes from the client's request [metadata](https://modelcontextprotocol.io/specification/basic/index#_meta).
</Info>

### Metadata Access

Agent metadata flows through initialization:

```python
@mcp.initialize
async def initialize_environment(ctx):
    # From agent's metadata class variable
    width = ctx.meta.get("display_width", 1920) if ctx.meta else 1920
    height = ctx.meta.get("display_height", 1080) if ctx.meta else 1080
```

## Testing

```bash
# CLI testing
hud debug my-env:latest
hud analyze my-env:latest

# Python testing
async def test():
    from hud.clients import MCPClient
    
    client = MCPClient({
        "env": {"command": "docker", "args": ["run", "-i", "my-env"]}
    })
    
    async with client:
        tools = await client.list_tools()
        result = await client.call_tool("setup", {"value": 0})
```

## See Also

* [Build Environments](/build-environments) - Conceptual guide
* [Tools](/reference/tools) - Tool implementation reference
* [Advanced Design](/build-environments/registry) - Hub pattern details


# Tasks
Source: https://docs.hud.so/reference/tasks

SDK reference for task configuration and dataset utilities

The HUD SDK provides the `Task` class for defining agent objectives and dataset utilities for managing task collections.

## Task Class

```python
from hud.datasets import Task
```

Pydantic model that defines an agent's objective, setup, and evaluation criteria.

**Fields:**

| Field           | Type                                       | Description                          | Default  |
| --------------- | ------------------------------------------ | ------------------------------------ | -------- |
| `id`            | `str \| None`                              | Unique identifier (UUID recommended) | `None`   |
| `prompt`        | `str`                                      | Task instruction for the agent       | Required |
| `mcp_config`    | `dict[str, Any]`                           | MCP server configuration             | Required |
| `setup_tool`    | `MCPToolCall \| list[MCPToolCall] \| None` | Tool(s) to prepare environment       | `None`   |
| `evaluate_tool` | `MCPToolCall \| list[MCPToolCall] \| None` | Tool(s) to score performance         | `None`   |
| `system_prompt` | `str \| None`                              | Additional system prompt             | `None`   |
| `metadata`      | `dict[str, Any]`                           | Extra task metadata                  | `{}`     |

### Environment Variable Substitution

The `mcp_config` field automatically resolves environment variables using `${VAR_NAME}` syntax:

```python
task = Task(
    prompt="Navigate to the dashboard",
    mcp_config={
        "browser": {
            "url": "${HUD_MCP_URL:https://mcp.hud.so/v3/mcp}",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "hudpython/hud-browser:latest"
            }
        }
    }
)
```

**Substitution happens automatically when Task is created from a dict** - environment variables are resolved using Python's `Template.substitute()` with a defaultdict that returns empty strings for missing variables.

### Field Validators

Task automatically:

1. **Parses JSON strings** - `mcp_config` and `metadata` can be JSON strings
2. **Converts dicts to MCPToolCall** - `setup_tool` and `evaluate_tool` dicts are converted
3. **Resolves environment variables** - Only when created from dict (preserves templates in model\_dump())

```python
# All of these work:
task = Task(
    prompt="Test",
    mcp_config='{"server": {"url": "..."}}',  # JSON string → dict
    setup_tool='{"name": "setup"}'            # JSON string → MCPToolCall
)

task = Task(
    prompt="Test", 
    mcp_config={...},
    setup_tool={"name": "setup", "arguments": {...}}  # dict → MCPToolCall
)

task = Task(
    prompt="Test",
    mcp_config={...},
    evaluate_tool=[  # List of tool calls
        {"name": "check_text", "arguments": {"text": "Success"}},
        {"name": "check_url", "arguments": {"url": "example.com"}}
    ]
)
```

## Dataset Functions

### run\_dataset

```python
from hud.datasets import run_dataset

results = await run_dataset(
    name="My Evaluation",
    dataset="hud-evals/SheetBench-50",  # or Dataset object or list[dict]
    agent_class=ClaudeAgent,
    agent_config={"model": "claude-3-5-sonnet"},
    max_concurrent=50,
    max_steps=40
)
```

Run all tasks in a dataset with automatic parallelization and job tracking.

**Parameters:**

| Parameter              | Type                           | Description                                  | Default   |
| ---------------------- | ------------------------------ | -------------------------------------------- | --------- |
| `name`                 | `str`                          | Job name for tracking                        | Required  |
| `dataset`              | `str \| Dataset \| list[dict]` | HF dataset ID, Dataset object, or task dicts | Required  |
| `agent_class`          | `type[MCPAgent]`               | Agent class to instantiate                   | Required  |
| `agent_config`         | `dict[str, Any] \| None`       | Constructor kwargs for agent                 | `None`    |
| `max_concurrent`       | `int`                          | Maximum parallel tasks                       | `50`      |
| `metadata`             | `dict[str, Any] \| None`       | Job metadata                                 | `None`    |
| `max_steps`            | `int`                          | Max steps per task                           | `40`      |
| `split`                | `str`                          | Dataset split when loading by ID             | `"train"` |
| `auto_respond`         | `bool`                         | Use ResponseAgent for continuations          | `False`   |
| `custom_system_prompt` | `str \| None`                  | Override dataset system prompt               | `None`    |

**Returns:** `list[Trace]` - Results in dataset order

**Features:**

* Automatic job tracking with `hud.job()`
* System prompt loading from dataset's `system_prompt.txt`
* Parallel execution with semaphore control
* Progress tracking with individual traces
* Graceful error handling (continues on failures)

**Example:**

```python
# From HuggingFace dataset ID
results = await run_dataset(
    "SheetBench Evaluation",
    "hud-evals/SheetBench-50",
    ClaudeAgent,
    {"model": "claude-3-5-sonnet-20241022"}
)

# From Dataset object
from datasets import load_dataset
dataset = load_dataset("hud-evals/OSWorld-Verified", split="train")
results = await run_dataset("OSWorld Test", dataset, OperatorAgent)

# From list of dicts
tasks = [
    {
        "prompt": "Click submit",
        "mcp_config": {...},
        "setup_tool": {"name": "setup"}
    }
]
results = await run_dataset("Custom", tasks, ClaudeAgent)
```

### fetch\_system\_prompt\_from\_dataset

```python
from hud.datasets import fetch_system_prompt_from_dataset

system_prompt = await fetch_system_prompt_from_dataset("hud-evals/SheetBench-50")
```

Fetch `system_prompt.txt` from a HuggingFace dataset repository.

**Returns:** `str | None` - System prompt text if found

**Note:** Requires `huggingface_hub` to be installed.

### save\_tasks

```python
from hud.datasets import save_tasks

# IMPORTANT: Pass dicts, not Task objects!
task_dicts = [task.model_dump() for task in tasks]
save_tasks(
    tasks=task_dicts,
    repo_id="my-org/my-tasks",
    private=False,
    dataset_name="My Tasks v1.0"
)
```

Save tasks to HuggingFace dataset with JSON string serialization.

**Parameters:**

| Parameter  | Type                   | Description                          | Default  |
| ---------- | ---------------------- | ------------------------------------ | -------- |
| `tasks`    | `list[dict[str, Any]]` | Task dictionaries (NOT Task objects) | Required |
| `repo_id`  | `str`                  | HuggingFace repository ID            | Required |
| `**kwargs` | `Any`                  | Additional args for `push_to_hub()`  | -        |

**Important:** Always pass dictionaries to preserve environment variable templates:

```python
# ✅ GOOD - Preserves ${HUD_API_KEY} in the dataset
task_dicts = [task.model_dump() for task in tasks]
save_tasks(task_dicts, "my-dataset")

# ❌ BAD - Would expose resolved API keys!
# save_tasks(tasks, "my-dataset")  # Don't do this!
```

**Dataset Schema:**
The function converts complex fields to JSON strings for clean HuggingFace schema:

* `mcp_config` → JSON string
* `setup_tool` → JSON string (if present)
* `evaluate_tool` → JSON string (if present)
* `metadata` → JSON string (if present)

## MCPToolCall Type

```python
from hud.types import MCPToolCall
```

Defines a tool invocation with arguments.

**Fields:**

| Field       | Type             | Description       | Default  |
| ----------- | ---------------- | ----------------- | -------- |
| `name`      | `str`            | Tool name to call | Required |
| `arguments` | `dict[str, Any]` | Tool arguments    | `{}`     |

## Real-World Examples

### Loading Tasks from Datasets

From `examples/run_evaluation.py`:

```python
# Load dataset
dataset = load_dataset("hud-evals/SheetBench-50", split="train")

# Get system prompt
dataset_system_prompt = await fetch_system_prompt_from_dataset("hud-evals/SheetBench-50")

# Run single task
sample_task = dataset[0]
task = Task(**sample_task)  # Auto-converts dict to Task

# Or run all tasks
results = await run_dataset(
    "Full SheetBench Evaluation",
    dataset,
    ClaudeAgent,
    {"model": "claude-3-5-sonnet-20241022"},
    max_concurrent=50
)
```

### Task Structure in Datasets

From `environments/text_2048/2048_taskconfigs.json`:

```json
{
    "id": "2048_target_256",
    "prompt": "You are playing 2048 and your goal is to reach the 256 tile.",
    "mcp_config": {
        "local": {
            "command": "docker",
            "args": ["run", "--rm", "-i", "hud-text-2048"]
        }
    },
    "setup_tool": {
        "name": "setup",
        "arguments": {
            "name": "board",
            "arguments": {"board_size": 4}
        }
    },
    "evaluate_tool": {
        "name": "evaluate", 
        "arguments": {
            "name": "max_number",
            "arguments": {"target": 256}
        }
    },
    "metadata": {
        "difficulty": 256,
        "game": "2048"
    }
}
```

### Creating and Saving Tasks

```python
import os
from uuid import uuid4

# Set environment variables
os.environ["HUD_API_KEY"] = "sk-hud-..."
os.environ["BROWSER_PROVIDER"] = "anchorbrowser"

# Create tasks with env var templates
tasks = []
for url in ["example.com", "test.com"]:
    task = Task(
        id=str(uuid4()),
        prompt=f"Navigate to {url} and find the login button",
        mcp_config={
            "browser": {
                "url": "${HUD_MCP_URL:https://mcp.hud.so/v3/mcp}",
                "headers": {
                    "Authorization": "Bearer ${HUD_API_KEY}",
                    "X-Provider": "${BROWSER_PROVIDER}"
                }
            }
        },
        setup_tool=MCPToolCall(
            name="setup",
            arguments={"name": "navigate", "arguments": {"url": url}}
        ),
        evaluate_tool=MCPToolCall(
            name="evaluate", 
            arguments={"name": "element_exists", "arguments": {"selector": "button.login"}}
        ),
        metadata={
            "category": "navigation",
            "difficulty": "easy"
        }
    )
    tasks.append(task)

# Save to HuggingFace (preserves env var templates)
task_dicts = [t.model_dump() for t in tasks]
save_tasks(task_dicts, "my-org/navigation-tasks")
```

### Agent Integration

Tasks automatically configure agents:

```python
task = Task(
    prompt="Complete the form",
    mcp_config={...},
    setup_tool={...},
    evaluate_tool={...},
    system_prompt="Be very careful with form fields"
)

agent = ClaudeAgent()

# When agent.run(task) is called:
# 1. If no mcp_client, creates one from task.mcp_config
# 2. Adds task.system_prompt to agent's system prompt
# 3. Adds lifecycle tools to agent.lifecycle_tools
# 4. Runs setup_tool (hidden from agent)
# 5. Executes task.prompt
# 6. Runs evaluate_tool (hidden from agent)
# 7. Returns Trace with evaluation reward
```

## Best Practices

1. **Use UUIDs for task IDs** - Required for HuggingFace datasets
2. **Save dictionaries, not objects** - Preserves env var templates
3. **Include system\_prompt.txt** - Upload to dataset repo root
4. **Use metadata for filtering** - Category, difficulty, tags
5. **Test locally first** - Before uploading to HuggingFace
6. **Version your datasets** - Use meaningful repo names

## Common Patterns

### Filtering Tasks

```python
# Load all tasks
dataset = load_dataset("hud-evals/WebArena-Lite", split="train")

# Filter in Python
login_tasks = [
    task for task in dataset
    if "login" in task["prompt"].lower()
]

# Convert and run
results = await run_dataset(
    "Login Tasks Only",
    login_tasks,
    ClaudeAgent
)
```

### Custom System Prompts

```python
# Override dataset system prompt
results = await run_dataset(
    "Custom Evaluation",
    "hud-evals/SheetBench-50",
    ClaudeAgent,
    custom_system_prompt="You are an expert spreadsheet user. Be precise."
)
```

### Environment Variable Management

```python
# Tasks preserve templates
task = Task(
    prompt="Test",
    mcp_config={
        "server": {
            "api_key": "${API_KEY:default-key}"
        }
    }
)

# model_dump() preserves the template
data = task.model_dump()
assert data["mcp_config"]["server"]["api_key"] == "${API_KEY:default-key}"

# But task object has resolved value
os.environ["API_KEY"] = "real-key"
task2 = Task(**data)
# task2.mcp_config["server"]["api_key"] is now "real-key"
```

## See Also

* [Task System](/core-concepts/task-system) - Conceptual overview
* [Create Benchmarks](/evaluate-agents/create-benchmarks) - Building datasets
* [Agents](/reference/agents) - How agents use tasks


# Tools
Source: https://docs.hud.so/reference/tools

SDK reference for HUD tools and executors

The HUD SDK provides a comprehensive set of tools for environment interaction. All tools inherit from `BaseTool` and return standardized MCP content blocks.

## Base Classes

### BaseTool

```python
from hud.tools import BaseTool
```

Abstract base class for all MCP tools. Provides standardized output formatting and MCP registration.

**Constructor Parameters:**

| Parameter     | Type  | Description                                         | Default             |
| ------------- | ----- | --------------------------------------------------- | ------------------- |
| `env`         | `Any` | Optional stateful context (game, executor, browser) | `None`              |
| `name`        | `str` | Tool name for MCP registration                      | Auto from class     |
| `title`       | `str` | Human-readable display name                         | Auto from class     |
| `description` | `str` | Tool description                                    | Auto from docstring |

**Abstract Methods:**

```python
async def __call__(self, **kwargs: Any) -> list[ContentBlock]
```

**Properties:**

* `mcp` - FastMCP FunctionTool wrapper for registration

**Usage:**

```python
class MyTool(BaseTool):
    async def __call__(self, param: str) -> list[ContentBlock]:
        result = await self.env.do_something(param)
        return [TextContent(text=result, type="text")]

# MCPServer automatically handles BaseTool instances
tool = MyTool(env=my_context)
mcp_server.add_tool(tool)  # No need for .mcp property
```

### BaseHub

```python
from hud.tools import BaseHub
```

A composition-friendly FastMCP server for organizing tools into nested namespaces.

**Key Features:**

* Internal tool dispatcher pattern
* Automatic resource catalog generation
* Hidden internal tools from MCP clients

**Usage:**

```python
hub = BaseHub("evaluators")

@hub.tool()  # Internal tool, hidden from agents
async def url_match(url: str) -> EvaluationResult:
    return EvaluationResult(reward=1.0, done=True)

# Access via dispatcher when mounted on server
# Agents call: evaluators(name="url_match", arguments={"url": "..."})
```

## Core Tools

### BashTool

```python
from hud.tools import BashTool
```

Execute bash commands in a persistent shell session.

**Constructor Parameters:**

| Parameter | Type           | Description                 | Default |
| --------- | -------------- | --------------------------- | ------- |
| `session` | `_BashSession` | Pre-configured bash session | `None`  |

**Tool Parameters:**

| Parameter | Type   | Description              | Default  |
| --------- | ------ | ------------------------ | -------- |
| `command` | `str`  | Bash command to execute  | Required |
| `restart` | `bool` | Restart the bash session | `False`  |

**Example:**

```python
bash = BashTool()
result = await bash(command="ls -la")
# Returns: [TextContent(text="file1.txt\nfile2.txt", type="text")]

# Restart session
await bash(restart=True)
```

### EditTool

```python
from hud.tools import EditTool
```

File system editor with undo support.

**Constructor Parameters:**

| Parameter      | Type                    | Description           | Default |
| -------------- | ----------------------- | --------------------- | ------- |
| `file_history` | `dict[Path, list[str]]` | Edit history per file | `None`  |

**Commands:**

| Command       | Parameters                       | Description                     |
| ------------- | -------------------------------- | ------------------------------- |
| `view`        | `path`, `view_range`             | View file or directory contents |
| `create`      | `path`, `file_text`              | Create new file                 |
| `str_replace` | `path`, `old_str`, `new_str`     | Replace string in file          |
| `insert`      | `path`, `insert_line`, `new_str` | Insert text at line             |
| `undo_edit`   | `path`                           | Undo last edit                  |

**Example:**

```python
editor = EditTool()

# View file
await editor(command="view", path="/path/to/file.py", view_range=[1, 20])

# Create file
await editor(command="create", path="/new/file.py", file_text="print('hello')")

# Replace text
await editor(command="str_replace", path="/file.py", 
            old_str="old_text", new_str="new_text")
```

### PlaywrightTool

```python
from hud.tools import PlaywrightTool
```

Web automation using Playwright browser.

**Constructor Parameters:**

| Parameter | Type   | Description                  | Default |
| --------- | ------ | ---------------------------- | ------- |
| `page`    | `Page` | Existing Playwright page     | `None`  |
| `cdp_url` | `str`  | Chrome DevTools Protocol URL | `None`  |

**Actions:**

| Action             | Parameters                   | Description                |
| ------------------ | ---------------------------- | -------------------------- |
| `navigate`         | `url`, `wait_for_load_state` | Navigate to URL            |
| `screenshot`       | -                            | Capture page screenshot    |
| `click`            | `selector`                   | Click element              |
| `type`             | `selector`, `text`           | Type text in element       |
| `get_page_info`    | -                            | Get page title and URL     |
| `wait_for_element` | `selector`                   | Wait for element to appear |

**Example:**

```python
tool = PlaywrightTool()

# Navigate
await tool(action="navigate", url="https://example.com")

# Click button
await tool(action="click", selector="button.submit")

# Type text
await tool(action="type", selector="input#search", text="query")
```

## Computer Control Tools

### HudComputerTool

```python
from hud.tools import HudComputerTool
```

Universal computer control with automatic scaling and executor selection.

**Constructor Parameters:**

| Parameter        | Type                             | Description         | Default     |
| ---------------- | -------------------------------- | ------------------- | ----------- |
| `executor`       | `BaseExecutor`                   | Executor to use     | Auto-detect |
| `platform_type`  | `"auto"`, `"xdo"`, `"pyautogui"` | Executor type       | `"auto"`    |
| `display_num`    | `int`                            | X display number    | From env    |
| `width`          | `int`                            | Agent screen width  | 1280        |
| `height`         | `int`                            | Agent screen height | 720         |
| `rescale_images` | `bool`                           | Rescale screenshots | `True`      |

**Actions:**

| Action       | Parameters                                 | Description           |
| ------------ | ------------------------------------------ | --------------------- |
| `screenshot` | -                                          | Capture screen        |
| `click`      | `x`, `y`, `button`, `pattern`, `hold_keys` | Mouse click           |
| `write`      | `text`, `enter_after`, `delay`             | Type text             |
| `press`      | `keys`                                     | Press key combination |
| `scroll`     | `x`, `y`, `scroll_x`, `scroll_y`           | Scroll                |
| `drag`       | `start_x`, `start_y`, `end_x`, `end_y`     | Drag mouse            |

**Example:**

```python
computer = HudComputerTool()

# Take screenshot
await computer(action="screenshot")

# Click at coordinates
await computer(action="click", x=100, y=200)

# Type text
await computer(action="write", text="Hello World", enter_after=True)

# Press hotkey
await computer(action="press", keys=["ctrl", "c"])
```

### AnthropicComputerTool

```python
from hud.tools import AnthropicComputerTool
```

Computer control optimized for Anthropic's Claude models.

**Features:**

* Pre-configured for 1280x720 resolution
* Optimized action names for Claude
* Built-in screenshot scaling

### OpenAIComputerTool

```python
from hud.tools import OpenAIComputerTool
```

Computer control optimized for OpenAI models.

**Features:**

* Pre-configured for 1920x1080 resolution
* Simplified action interface
* No automatic screenshot scaling

## Executors

Executors provide platform-specific implementations for computer control actions.

### BaseExecutor

```python
from hud.tools.executors import BaseExecutor
```

Abstract base providing simulation mode for all actions.

**Core Methods:**

* `click(x, y, button, pattern, hold_keys)` - Mouse click
* `write(text, enter_after, delay)` - Type text
* `press(keys)` - Press key combination
* `scroll(x, y, scroll_x, scroll_y)` - Scroll
* `drag(start_x, start_y, end_x, end_y)` - Drag mouse
* `screenshot()` - Capture screen
* `get_screen_size()` - Get display dimensions

### PyAutoGUIExecutor

```python
from hud.tools.executors import PyAutoGUIExecutor
```

Cross-platform executor using PyAutoGUI library.

**Features:**

* Works on Windows, macOS, Linux
* Real mouse/keyboard control
* Screenshot capture
* Automatic failsafe

**Example:**

```python
executor = PyAutoGUIExecutor()
computer = HudComputerTool(executor=executor)
```

### XDOExecutor

```python
from hud.tools.executors import XDOExecutor
```

Linux/X11 executor using xdotool.

**Features:**

* Native X11 integration
* Faster than PyAutoGUI on Linux
* Support for X display selection
* Window management capabilities

**Example:**

```python
executor = XDOExecutor(display_num=1)
computer = HudComputerTool(executor=executor)
```

## Common Types

### ContentBlock

MCP standard output format (from `mcp.types`):

```python
from mcp.types import TextContent, ImageContent

# Text output
TextContent(text="Operation complete", type="text")

# Image output  
ImageContent(data="base64_data", mimeType="image/png", type="image")
```

### EvaluationResult

```python
from hud.tools.types import EvaluationResult

result = EvaluationResult(
    reward=0.8,        # Score 0-1
    done=True,         # Task complete
    content="Details", # Optional text
    info={"score": 80} # Metadata
)
```

### ContentResult

```python
from hud.tools.types import ContentResult

# Helper for building complex outputs
result = ContentResult(
    output="Success message",
    error="Error if any",
    base64_image="screenshot_data",
    system="System message"
)

# Convert to ContentBlocks
blocks = result.to_content_blocks()
```

## Integration Examples

### Adding Tools to Environment

```python
from hud.server import MCPServer
from hud.tools import BashTool, EditTool

mcp = MCPServer(name="my-env")

# MCPServer handles BaseTool instances automatically
bash = BashTool()
mcp.add_tool(bash)  # Internally uses bash.mcp

editor = EditTool()
mcp.add_tool(editor)  # Same here
```

### Custom Tool Implementation

```python
from hud.tools import BaseTool
from mcp.types import TextContent

class DatabaseTool(BaseTool):
    def __init__(self, db_connection):
        super().__init__(
            env=db_connection,
            name="database",
            title="Database Query Tool",
            description="Execute SQL queries"
        )
    
    async def __call__(self, query: str) -> list[ContentBlock]:
        try:
            results = await self.env.execute(query)
            return [TextContent(text=str(results), type="text")]
        except Exception as e:
            return [TextContent(text=f"Error: {e}", type="text")]
```

### Hub Pattern for Evaluators

```python
from hud.tools import BaseHub
from hud.tools.types import EvaluationResult

evaluators = BaseHub("evaluate")

@evaluators.tool("text_contains")
async def check_text(text: str, target: str) -> EvaluationResult:
    return EvaluationResult(
        reward=1.0 if target in text else 0.0,
        done=True,
        content=f"Checking if '{target}' in text"
    )

# Use in environment
@mcp.tool()
async def evaluate(name: str, **kwargs):
    return await evaluators.call_tool(name, kwargs)
```

## See Also

* [Build Environments](/build-environments) - Using tools in environments
* [Architecture](/core-concepts/architecture) - How tools fit in HUD
* [Adapting Software](/build-environments/adapting-software) - Tool patterns


# Datasets
Source: https://docs.hud.so/train-agents/datasets

Create and manage training datasets

Training datasets are collections of HUD tasks stored on HuggingFace. Each task defines a prompt, MCP configuration, and evaluation criteria.

## Creating Datasets

```python
from hud.datasets import Task, save_tasks
from hud.types import MCPToolCall

# Create task variations
tasks = []
for i in range(100):
    tasks.append(Task(
        prompt=f"Reach the {2**(9+i%4)} tile in 2048",
        mcp_config={"hudpython/hud-text-2048:v1.2": {}},
        setup_tool="setup_board",
        evaluate_tool=MCPToolCall(
            name="evaluate_max_tile",
            arguments={"target": 2**(9+i%4)}
        )
    ))

# Save to HuggingFace
save_tasks(tasks, "my-org/2048-training")
```

## Dataset Format

Tasks are stored as HuggingFace datasets with these fields:

```python
{
    "id": "task-001",
    "prompt": "Navigate to login page",
    "mcp_config": '{"hudpython/hud-browser:latest": {}}',  # JSON string
    "setup_tool": '{"name": "clear_cookies"}',              # JSON string
    "evaluate_tool": '{"name": "check_logged_in"}',         # JSON string
    "metadata": '{"difficulty": "easy"}'                    # JSON string
}
```

## Loading Datasets

```python
from datasets import load_dataset
from hud.datasets import Task

# Load from HuggingFace
dataset = load_dataset("hud-evals/2048-taskset", split="train")
tasks = [Task(**task_dict) for task_dict in dataset]

# Filter tasks
easy_tasks = [t for t in tasks if t.metadata.get("difficulty") == "easy"]

# Use in training
env = vf.load_environment(
    env_id="hud-vf-gym",
    taskset="hud-evals/2048-taskset",
    config_path="./configs/2048.yaml",
    num_tasks=50  # Sample 50 tasks
)
```

## Curriculum Learning

Create staged datasets for progressive training:

```python
# Stage 1: Simple moves
save_tasks(create_simple_tasks(), "my-org/2048-stage1")

# Stage 2: Strategic play  
save_tasks(create_medium_tasks(), "my-org/2048-stage2")

# Stage 3: Expert level
save_tasks(create_hard_tasks(), "my-org/2048-stage3")
```

## Best Practices

* **Diversity**: Include varied prompts and scenarios
* **Balance**: Mix difficulty levels appropriately
* **Size**: 100-1000 tasks per dataset typically
* **Validation**: Keep 10-20% for evaluation

## Example Datasets

* [hud-evals/2048-taskset](https://huggingface.co/datasets/hud-evals/2048-taskset) - 2048 game variations
* [hud-evals/sheetbench-50](https://huggingface.co/datasets/hud-evals/sheetbench-50) - Spreadsheet tasks


# Train Agents
Source: https://docs.hud.so/train-agents/index

Improve agent performance with reinforcement learning

Train agents to perform better on specific tasks using reinforcement learning. Small models trained on your tasks can outperform larger general models.

## Overview

HUD integrates with [Verifiers](https://github.com/willccbb/verifiers) for GRPO (Group Relative Policy Optimization) training:

```python
import verifiers as vf

# Load environment
env = vf.load_environment(
    env_id="hud-vf-gym",
    taskset="hud-evals/2048-taskset",
    config_path="configs/2048.yaml"
)

# Train with GRPO
model, tokenizer = vf.get_model_and_tokenizer("Qwen/Qwen2.5-3B-Instruct")
trainer = vf.GRPOTrainer(
    model=model,
    env=env,
    args=vf.grpo_defaults(),
    peft_config=vf.lora_defaults()  # LoRA for efficiency
)
trainer.train()
```

## Why Train?

* **Task Performance**: 3B model can reach 80% success vs 35% baseline
* **Cost**: Smaller models are cheaper to run
* **Speed**: Trained models complete tasks faster
* **Consistency**: Less variance in behavior

## Requirements

<Table>
  <thead>
    <tr>
      <th>Model Size</th>
      <th>GPU Memory</th>
      <th>Training Time (100 steps)</th>
    </tr>
  </thead>

  <tbody>
    <tr>
      <td>3B</td>
      <td>24GB (1x RTX 4090)</td>
      <td>30 minutes</td>
    </tr>

    <tr>
      <td>7B</td>
      <td>40GB (1x A100)</td>
      <td>1 hour</td>
    </tr>

    <tr>
      <td>70B</td>
      <td>320GB (4x A100)</td>
      <td>6+ hours</td>
    </tr>
  </tbody>
</Table>

## Monitoring & Analytics

Track training progress in real-time:

* **[app.hud.so](https://app.hud.so)** - View traces, metrics, and rewards for each training generation
* **Weights & Biases** - Detailed ML metrics and loss curves
* **Local logs** - Training checkpoints and evaluation results

## Getting Started

<CardGroup cols={3}>
  <Card title="RL Quickstart" icon="rocket" href="/train-agents/rl-quickstart">
    Train your first agent in 30 minutes
  </Card>

  <Card title="Config System" icon="gear" href="/train-agents/verifiers-gym">
    Configure tools with YAML
  </Card>

  <Card title="Datasets" icon="database" href="/train-agents/datasets">
    Create training datasets
  </Card>
</CardGroup>


# RL Quickstart
Source: https://docs.hud.so/train-agents/rl-quickstart

Train agents with reinforcement learning using GRPO

Train agents to perform better on specific tasks using GRPO (Group Relative Policy Optimization).

## Quick Start

### Prerequisites

* GPU with 24GB+ VRAM (A100, RTX 4090)
* Python 3.12+
* HUD API key (set `HUD_API_KEY` environment variable)

### Installation

```bash
# Clone repository
git clone https://github.com/hud-evals/hud-python
cd hud-python/rl

# Clone hud-vf-gym (required)
git clone https://github.com/hud-evals/hud-vf-gym.git

# Install with uv (recommended)
uv sync
source .venv/bin/activate

```

### Train 2048 Agent

<Steps>
  <Step title="Build Docker Image">
    ```bash
    # Set API key
    export HUD_API_KEY="YOUR_API_KEY"

    # Build the 2048 environment
    cd ../../environments/text_2048/
    docker build -t hud-text-2048 .
    cd ../../rl/
    ```
  </Step>

  <Step title="Start vLLM Server">
    ```bash
    # Terminal 1 - GPU 0
    CUDA_VISIBLE_DEVICES=0 vf-vllm \
      --model Qwen/Qwen2.5-3B-Instruct \
      --enforce-eager \
      --disable-log-requests
    ```
  </Step>

  <Step title="Run Training">
    ```bash
    # Terminal 2 - GPU 1
    CUDA_VISIBLE_DEVICES=1 python train_2048.py
    ```
  </Step>
</Steps>

The included `train_2048.py` script trains a 3B model on the 2048 game. Training takes \~30 minutes for 100 steps.

![RL Training Progress](https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/src/images/rl_2.png)
*Qwen-2.5-3B agent training on the [text-2048](https://github.com/hud-evals/hud-python/tree/main/environments/text_2048) environment using GRPO*

## How It Works

GRPO trains by comparing multiple attempts on the same task:

```mermaid
graph LR
    A[Task] --> B[Generate 16 Attempts]
    B --> C[Rank by Reward]
    C --> D[Learn from Best]
    D --> E[Improved Model]
```

## Configuration

Training uses YAML configs to map agent tools to MCP tools.:

```yaml
# From rl/configs/2048.yaml
system_prompt: |
  You are an excellent 2048 player. 
  Available moves: left(), right(), up(), down(), done()
  Return ONLY: <tool>move()</tool>

action_mappings:
  left:
    _tool: "move"
    direction: {static: "left"}
  right:
    _tool: "move"  
    direction: {static: "right"}
  # ... up, down similar
```

## Custom Training

```python
import verifiers as vf

# Load environment
env = vf.load_environment(
    env_id="hud-vf-gym",
    taskset="your-org/your-taskset",  # HuggingFace dataset
    config_path="configs/your-env.yaml",  # In hud-vf-gym
    num_tasks=10
)

# Configure training (matching train_2048.py)
model, tokenizer = vf.get_model_and_tokenizer("Qwen/Qwen2.5-3B-Instruct")
args = vf.grpo_defaults(run_name="my-training")
args.gradient_accumulation_steps = 2
args.per_device_train_batch_size = 8
args.num_generations = 16
args.max_steps = 100
args.learning_rate = 1e-6

# Train with LoRA
trainer = vf.GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    env=env,
    args=args,
    peft_config=vf.lora_defaults()
)
trainer.train()
```

## Monitoring

### HUD Dashboard

View training progress and traces at [app.hud.so](https://app.hud.so):

* Real-time screenshot and setp viewing
* Tool call traces for each trajectory
* Reward distribution across generations

### Weights & Biases

For detailed ML metrics:

```python
args.report_to = "wandb"
os.environ["WANDB_API_KEY"] = "your-key"
```

## Resources

* [Example Training Script](https://github.com/hud-evals/hud-python/blob/main/rl/train_2048.py)
* [HUD VF Gym](https://github.com/hud-evals/hud-vf-gym) - Verifiers integration


# Config System
Source: https://docs.hud.so/train-agents/verifiers-gym

YAML-based tool configuration for training

Verifiers Gym uses YAML configs to map agent-facing tools to MCP tools, enabling flexible environment configuration without code changes.

## Basic Configuration

```yaml configs/browser.yaml
# Define agent interface
system_prompt: |
  You are an AI assistant that can interact with a browser.
  
  Available tools:
  - screenshot(): Take a screenshot
  - click(x, y): Click at coordinates
  - type(text): Type text
  - done(): Complete the task
  
  Use <tool>function(args)</tool> format.

# Map agent tools to MCP tools
action_mappings:
  screenshot:
    _tool: "computer"
    action: "screenshot"
  
  click:
    _parser:
      positional: ["x", "y"]  # Expected arguments
    _tool: "computer"
    action: "click"
    x: {from_arg: "x"}
    y: {from_arg: "y"}
  
  type:
    _parser:
      positional: ["text"]
    _tool: "computer"
    action: "type"
    text: {from_arg: "text"}
```

## Advanced Features

### Argument Transformation

Transform arguments before passing to MCP:

```yaml
scroll:
  _parser:
    positional: ["direction", "amount"]
  _tool: "computer"
  action: "scroll"
  coordinate: [640, 400]  # Static value
  direction:
    from_arg: "direction"
  amount:
    from_arg: "amount"
    transform: "lambda a: int(a * 120)"  # Convert to pixels
```

### Context-Aware Transforms

Access other arguments in transforms:

```yaml
key:
  _parser:
    positional: ["key"]
  _tool: "computer"
  action: "press"
  keys:
    from_arg: "key"
    transform: "lambda x: x.split('+')"  # "ctrl+a" → ["ctrl", "a"]
```

### Static Values

Use fixed values instead of arguments:

```yaml
# 2048 game directions
left:
  _tool: "move"
  direction: {static: "left"}

right:
  _tool: "move"
  direction: {static: "right"}
```

## Parser Settings

Control agent output parsing:

```yaml
parser:
  use_thinking: false  # Disable <think> tags
  xml_weight: 0.6     # XML format validation weight
  action_weight: 0.4  # Action syntax validation weight

rubric:
  weights:
    task_completion: 0.8    # Primary task reward
    tool_execution: 0.1     # Tool success rate
    format_compliance: 0.1  # Proper formatting
```

## Usage

```python
import verifiers as vf

# Use your config (path relative to where you run the script)
env = vf.load_environment(
    env_id="hud-vf-gym",
    taskset="your-org/your-taskset",
    config_path=configs/your-env.yaml"
)
```

## Examples

Config files are located in the `hud-python/rl/configs/`:

* `2048.yaml` - 2048 game with directional moves
* `default.yaml` - Browser/computer interaction