# Adapting Software
Source: https://docs.hud.so/build-environments/docker-to-mcp
Patterns for wrapping existing applications using HUD SDK tools
Learn how to adapt existing software into MCP environments using patterns from the HUD SDK. All examples link to actual implementations in [hud-evals/hud-python](https://github.com/hud-evals/hud-python).
## Browser Automation Pattern
The HUD SDK provides powerful computer tools that can control any browser-based application. See how [remote\_browser](https://github.com/hud-evals/hud-python/tree/main/environments/remote_browser) implements this pattern.
### Using HUD Computer Tools
```python
from hud.tools.computer import HudComputerTool
from hud.tools.executors.base import BaseExecutor
# Browser executor wraps Playwright
class BrowserExecutor(BaseExecutor):
def __init__(self, playwright_tool):
self.browser = playwright_tool
async def screenshot(self) -> str:
# Return base64 screenshot from browser
return await self.browser.screenshot()
async def click(self, x: int, y: int):
# Translate to browser coordinates
await self.browser.click(x, y)
# Use with computer tool
executor = BrowserExecutor(playwright_tool)
computer_tool = HudComputerTool(executor=executor)
```
See the full implementation:
* [BrowserExecutor](https://github.com/hud-evals/hud-python/blob/main/environments/remote_browser/src/hud_controller/tools/executor.py) - Wraps Playwright with computer tool interface
* [Server integration](https://github.com/hud-evals/hud-python/blob/main/environments/remote_browser/src/hud_controller/server.py#L202-L205) - How tools are registered
### Multiple Provider Support
```python
# Support different cloud browsers
provider = os.getenv("BROWSER_PROVIDER", "anchorbrowser")
browser_provider = get_provider(provider)
# Each provider implements the same interface
cdp_url = await browser_provider.launch()
browser = await playwright.connect_over_cdp(cdp_url)
```
See [providers directory](https://github.com/hud-evals/hud-python/tree/main/environments/remote_browser/src/hud_controller/providers) for implementations.
## Game Environment Pattern
Simple stateful applications like games use custom tools. See [text\_2048](https://github.com/hud-evals/hud-python/tree/main/environments/text_2048) for the pattern.
### Custom Tool Implementation
```python
from hud.tools.base import BaseTool
from mcp.types import ContentBlock, TextContent
class MoveTool(BaseTool):
"""Custom tool for game moves."""
def __init__(self, env):
super().__init__(
name="move",
description="Move tiles in a direction",
parameters={
"type": "object",
"properties": {
"direction": {"type": "string", "enum": ["up", "down", "left", "right"]}
},
"required": ["direction"]
}
)
self.env = env
async def __call__(self, direction: str) -> list[ContentBlock]:
"""Execute the move - this is what BaseTool calls."""
# Perform the move
self.env.move(direction)
# Return visual state
board_display = self.env.get_board_ascii()
return [TextContent(text=board_display, type="text")]
```
See [MoveTool implementation](https://github.com/hud-evals/hud-python/blob/main/environments/text_2048/src/hud_controller/tools/move.py).
## Docker Base Images
HUD provides optimized base images for common use cases:
### Browser Automation
```dockerfile
FROM hudpython/browser-base:latest
# Includes Chromium, Playwright, and dependencies
# Add your automation code
COPY src/ /app/src/
CMD ["python", "-m", "hud_controller.server"]
```
### Desktop GUI
```dockerfile
FROM hudpython/novnc-base:latest
# Ubuntu with XFCE, VNC, and noVNC pre-configured
# Install your GUI application
RUN apt-get update && apt-get install -y \
your-gui-app
# VNC is available on port 5901, noVNC on 8080
```
### Custom Base Images
When building from scratch:
```dockerfile
FROM python:3.11-slim
# Install system dependencies
RUN apt-get update && apt-get install -y \
required-packages \
&& rm -rf /var/lib/apt/lists/*
# Python environment
ENV PYTHONUNBUFFERED=1 \
PYTHONDONTWRITEBYTECODE=1
WORKDIR /app
```
See [base image examples](https://github.com/hud-evals/hud-python/tree/main/environments) in environment Dockerfiles.
## Environment Context Pattern
Share state between tools using environment context:
```python
# Define context class
class BrowserEnvironmentContext:
def __init__(self, browser):
self.browser = browser
self.page = None
async def goto(self, url: str):
self.page = await self.browser.new_page()
await self.page.goto(url)
# Share with tools in server.py
@mcp.initialize
async def initialize_environment(ctx):
browser = await launch_browser()
env = BrowserEnvironmentContext(browser)
# Make env available to all tools
global environment
environment = env
```
## Common Return Types
All tools should return standardized types for consistency.
### ContentBlock for Tool Output
```python
from mcp.types import ContentBlock, TextContent, ImageContent
# Text output
return [TextContent(text="Operation complete", type="text")]
# Image output
return [ImageContent(data=base64_data, mimeType="image/png", type="image")]
# Multiple outputs
return [
TextContent(text="Screenshot captured", type="text"),
ImageContent(data=screenshot, mimeType="image/png", type="image")
]
```
### EvaluationResult for Scoring
```python
from hud.tools.types import EvaluationResult
return EvaluationResult(
reward=0.8, # Score between 0-1
done=True, # Task complete?
info={ # Additional context
"score": 80,
"threshold": 75,
"details": "Form submitted successfully"
}
)
```
See [types.py](https://github.com/hud-evals/hud-python/blob/main/hud/tools/types.py) for full definitions.
## Tool Discovery and Resources
Expose available functions as MCP resources:
```python
@mcp.resource("setup://functions")
async def list_setup_functions():
"""List all available setup functions."""
return {
"functions": [
{"name": name, "description": func.__doc__}
for name, func in setup_functions.items()
]
}
```
See [resource examples](https://github.com/hud-evals/hud-python/blob/main/environments/browser/src/hud_controller/server.py#L67-L90).
## Error Handling Patterns
Consistent error handling across tools:
```python
from mcp import McpError
from mcp.types import INTERNAL_ERROR
try:
result = await risky_operation()
except SpecificError as e:
# Return error as content
return [TextContent(text=f"Operation failed: {e}", type="text")]
except Exception as e:
# Fatal errors
raise McpError(INTERNAL_ERROR, f"Unexpected error: {e}")
```
## Next Steps
Study complete implementations
Learn hub pattern, telemetry, and metadata
For step-by-step implementation guide, see the [environments README](https://github.com/hud-evals/hud-python/blob/main/environments/README.md).
# Build Environments
Source: https://docs.hud.so/build-environments/index
Wrap any software in MCP for agent evaluation
Turn any software that runs in Docker into an agent-testable environment. Whether it's a web app, desktop application, game, or API, HUD provides the tools to make it MCP-compatible.
## What Defines a HUD Environment?
A HUD environment is a Docker container that:
1. **Runs an MCP server** via stdio transport
2. **Implements lifecycle tools** (`setup` and `evaluate`)
3. **Exposes interaction tools** for agents to use
4. **Works identically** locally and remotely
The key is the Docker `CMD` entrypoint - it launches an MCP server that can be connected to by any MCP client, enabling both local development and remote scaling.
## Architecture & Lifecycle
```mermaid
graph LR
Agent[Agent] -->|MCP Protocol| Server[MCP Server]
Server --> Setup[Setup Tool]
Server --> Interact[Interaction Tools]
Server --> Evaluate[Evaluate Tool]
Setup -->|Initialize State| Environment
Interact -->|Control| Environment
Evaluate -->|Score Performance| Environment
```
### Lifecycle Flow
1. **Initialize**: MCP handshake establishes connection
2. **Setup**: Prepare environment to known state (hidden from agent)
3. **Interact**: Agent uses tools to complete task
4. **Evaluate**: Score agent's performance (hidden from agent)
## Environment Specification
### Required Components
Every HUD environment must have:
```python
from hud.server import MCPServer
mcp = MCPServer(name="my-environment")
# 1. Setup tool (lifecycle - hidden from agents)
@mcp.tool()
async def setup(**kwargs) -> dict:
"""Initialize environment state"""
# Reset to known state
# Load test data
# Configure settings
return {"status": "ready"}
# 2. Evaluate tool (lifecycle - hidden from agents)
@mcp.tool()
async def evaluate(**kwargs) -> dict:
"""Score agent performance"""
# Check task completion
# Calculate reward (0-1)
# Determine if done
return {
"reward": 0.8,
"done": True,
"info": {"details": "..."}
}
# 3. Interaction tools (visible to agents)
@mcp.tool()
async def perform_action(param: str) -> dict:
"""Tool agents use to interact"""
# Modify environment state
# Return feedback
return {"result": "success"}
```
### Standard File Structure
```
my-environment/
├── Dockerfile # Container definition
├── pyproject.toml # Python package config
├── README.md # Environment documentation
└── src/
└── hud_controller/ # Standard package name
├── __init__.py
├── server.py # MCP server entry point
├── context.py # Environment state (optional)
├── setup/ # Setup functions (if using hub)
├── evaluate/ # Evaluation functions (if using hub)
└── tools/ # Custom interaction tools
```
### Docker Requirements
The Dockerfile must:
* Install dependencies
* Copy source code
* Set proper working directory
* **Define CMD that runs the MCP server**
```dockerfile
FROM python:3.11-slim
WORKDIR /app
# Copy and install
COPY pyproject.toml ./
COPY src/ ./src/
RUN pip install -e .
# Critical: Launch MCP server on stdio
CMD ["python", "-m", "hud_controller.server"]
```
This `CMD` is what enables the same image to work both locally and remotely:
* Locally: `docker run -i` pipes stdio
* Remotely: HUD platform manages stdio transport
## MCPServer vs FastMCP
The HUD SDK provides `MCPServer`, which extends FastMCP with Docker-friendly features:
### What MCPServer Adds
1. **SIGTERM Handling**: Graceful shutdown in containers
```python
mcp = MCPServer(name="my-env")
mcp.run() # Automatically handles SIGTERM
```
2. **Initialize Decorator**: Async setup during MCP handshake
```python
@mcp.initialize
async def initialize_environment(ctx):
# ctx.session: MCP ServerSession
# ctx.meta: Metadata from client
# ctx.meta.progressToken: For progress updates
await setup_resources()
```
3. **Shutdown Decorator**: Cleanup after lifespan
```python
@mcp.shutdown
async def cleanup():
await close_connections()
```
4. **BaseTool Support**: Enhanced tool registration
```python
from hud.tools.base import BaseTool
custom_tool = MyTool() # Inherits from BaseTool
mcp.add_tool(custom_tool) # Uses tool.mcp automatically
```
## Common MCP Types
### Tool Outputs
All tools should return standardized types:
```python
from mcp.types import ContentBlock, TextContent, ImageContent
# Text output
return [TextContent(text="Operation complete", type="text")]
# Image output
return [ImageContent(
data=base64_screenshot,
mimeType="image/png",
type="image"
)]
# Multiple outputs
return [
TextContent(text="Status: Ready", type="text"),
ImageContent(data=screenshot, mimeType="image/png", type="image")
]
```
### Evaluation Results
```python
from hud.tools.types import EvaluationResult
return EvaluationResult(
reward=0.8, # Score between 0-1
done=True, # Task complete?
content="Details", # Optional description
info={ # Additional metadata
"score": 80,
"threshold": 75
}
)
```
## Quick Example
Here's a minimal environment:
```python
from hud.server import MCPServer
from mcp.types import TextContent
mcp = MCPServer(name="counter-env")
# State
counter = 0
@mcp.tool()
async def setup(start_value: int = 0):
"""Reset counter to starting value"""
global counter
counter = start_value
return {"status": "ready", "counter": counter}
@mcp.tool()
async def increment():
"""Increment the counter"""
global counter
counter += 1
return [TextContent(text=f"Counter: {counter}", type="text")]
@mcp.tool()
async def evaluate(target: int):
"""Check if counter reached target"""
global counter
return {
"reward": 1.0 if counter >= target else counter / target,
"done": counter >= target,
"info": {"counter": counter, "target": target}
}
if __name__ == "__main__":
mcp.run()
```
## Getting Started
Build your first environment step-by-step
Patterns for wrapping existing applications
Hub pattern, telemetry, and production tips
## Reference Implementations
Study these examples:
### text\_2048 (Simple)
* ASCII game with clear state
* Custom move tool
* Simple evaluation logic
### remote\_browser (Complex)
* Multi-provider browser automation
* Hub pattern for dozens of functions
* Production telemetry
## Key Concepts
### Tool Categories
1. **Setup Tools**: Initialize state (hidden from agents)
2. **Evaluate Tools**: Score performance (hidden from agents)
3. **Interaction Tools**: What agents use to control the environment
### MCP Communication
* **stdout**: Reserved for MCP protocol (JSON-RPC)
* **stderr**: All logs and debug output
* **stdio transport**: How containers communicate
### Process Separation for Hot-Reload Development
**Critical Design Principle**: For environments that need to maintain state during development (like browsers, databases, or running applications), you must separate the MCP server process from the actual environment process. This separation enables hot-reload development without losing state.
The MCP server should be a lightweight process that:
1. Starts quickly and can restart without side effects
2. Connects to your environment process on initialization
3. Acts as a bridge between MCP protocol and your environment
The environment process should:
1. Run as a separate, persistent service
2. Maintain state across MCP server restarts
3. Accept connections from the MCP server
#### Communication Methods
You can connect the MCP server to your environment using:
* **Unix Sockets** (recommended for local development)
```python
# In your MCP server
from hud.server.context import attach_context
ctx = attach_context("/tmp/my_env.sock")
```
* **TCP/HTTP API**
```python
# Connect to HTTP API
async with httpx.AsyncClient() as client:
response = await client.get("http://localhost:8080/state")
```
* **gRPC**
```python
# Connect via gRPC
channel = grpc.aio.insecure_channel('localhost:50051')
stub = MyServiceStub(channel)
```
* **Shared Memory/IPC**
```python
# Using multiprocessing managers
from multiprocessing.managers import BaseManager
manager.connect()
```
#### Example: Browser Environment Pattern
The browser environment demonstrates this pattern:
1. **Context Server** (`context_server.py`): Persistent process managing browser state
2. **MCP Server** (`server.py`): Lightweight server that restarts on code changes
3. **Connection**: Unix socket at `/tmp/hud_browser_ctx.sock`
```python
# context_server.py - Runs once, maintains state
class PersistentBrowserContext:
def __init__(self):
self.browser = None
self.apps = {}
def startup(self):
# Initialize browser, X11, VNC once
self.browser = launch_browser()
# server.py - Can restart anytime
@mcp.initialize
async def initialize_environment(ctx):
# Connect to persistent context
persistent_ctx = attach_context("/tmp/hud_browser_ctx.sock")
# Use existing browser state
browser = persistent_ctx.get_browser()
```
This separation allows `hud dev` to restart the MCP server for code changes while keeping your browser, database, or application running.
### Testing Your Environment
#### Traditional Testing
```bash
# Debug all phases
hud debug my-environment:latest
# Analyze available tools
hud analyze my-environment:latest
```
#### Hot-Reload Development
```bash
# Start development server with live reload
hud dev --build
# Edit src/ files - changes are immediately reflected
# Test in another terminal
hud analyze hud-my-environment:dev
```
The `hud dev` command:
* Mounts your `src/` directory into containers
* Auto-detects or builds Docker images
* Provides HTTP endpoint for Cursor integration
* Eliminates rebuild cycles during development
* **Restarts only the MCP server**, not your environment state
For stateful environments (browsers, databases, games), implement process separation so that `hud dev` hot-reload doesn't reset your environment state. See the browser environment for a complete example.
## SDK Reference
Complete API documentation for MCPServer
Available tools and executors
## Next Steps
For the complete step-by-step guide with all implementation details, see the [environments README](https://github.com/hud-evals/hud-python/blob/main/environments/README.md).
# TODO App Tutorial
Source: https://docs.hud.so/build-environments/quickstart-todo
Build your first MCP environment in 15 minutes
Let's build a TODO app environment from scratch. This hands-on tutorial will teach you the core concepts of MCP environment development.
## What We're Building
A simple TODO list that agents can interact with:
* Add, complete, and delete tasks
* List all tasks with their status
* Evaluate based on task completion
This tutorial assumes basic Python knowledge. Full code is available at `environments/todo-example`
## Step 1: Project Structure
Create a new directory for your environment:
```bash
mkdir todo-env
cd todo-env
```
Create the following structure:
```
todo-env/
├── Dockerfile
├── pyproject.toml
├── README.md
└── src/
└── hud_controller/
├── __init__.py
├── server.py
└── todo_app.py
```
## Step 2: The TODO Application
First, let's build the core TODO logic:
```python src/hud_controller/todo_app.py
from dataclasses import dataclass
from typing import List, Optional
from datetime import datetime
@dataclass
class TodoItem:
id: int
title: str
completed: bool = False
created_at: datetime = None
def __post_init__(self):
if self.created_at is None:
self.created_at = datetime.now()
class TodoApp:
def __init__(self):
self.todos: List[TodoItem] = []
self.next_id = 1
def add_todo(self, title: str) -> TodoItem:
"""Add a new TODO item"""
todo = TodoItem(id=self.next_id, title=title)
self.todos.append(todo)
self.next_id += 1
return todo
def complete_todo(self, todo_id: int) -> bool:
"""Mark a TODO as completed"""
for todo in self.todos:
if todo.id == todo_id:
todo.completed = True
return True
return False
def delete_todo(self, todo_id: int) -> bool:
"""Delete a TODO item"""
for i, todo in enumerate(self.todos):
if todo.id == todo_id:
self.todos.pop(i)
return True
return False
def list_todos(self) -> List[dict]:
"""Get all TODOs as dictionaries"""
return [
{
"id": todo.id,
"title": todo.title,
"completed": todo.completed,
"created_at": todo.created_at.isoformat()
}
for todo in self.todos
]
def clear_all(self):
"""Reset the TODO list"""
self.todos = []
self.next_id = 1
```
## Step 3: MCP Server Integration
Now wrap the TODO app with MCP:
```python src/hud_controller/server.py
import sys
import logging
from hud.server import MCPServer
from .todo_app import TodoApp
# Configure logging to stderr
logging.basicConfig(
stream=sys.stderr,
level=logging.INFO,
format='[%(levelname)s] %(asctime)s | %(name)s | %(message)s'
)
# Create server and app instances
mcp = MCPServer(name="TODO Environment")
app = TodoApp()
# Interaction tools (visible to agents)
@mcp.tool()
async def add_todo(title: str) -> dict:
"""Add a new TODO item to the list"""
todo = app.add_todo(title)
return {
"success": True,
"todo": {
"id": todo.id,
"title": todo.title
},
"message": f"Added TODO: {title}"
}
@mcp.tool()
async def complete_todo(todo_id: int) -> dict:
"""Mark a TODO item as completed"""
success = app.complete_todo(todo_id)
return {
"success": success,
"message": f"{'Completed' if success else 'Failed to complete'} TODO {todo_id}"
}
@mcp.tool()
async def list_todos() -> dict:
"""List all TODO items"""
todos = app.list_todos()
return {
"todos": todos,
"total": len(todos),
"completed": sum(1 for t in todos if t["completed"])
}
@mcp.tool()
async def delete_todo(todo_id: int) -> dict:
"""Delete a TODO item"""
success = app.delete_todo(todo_id)
return {
"success": success,
"message": f"{'Deleted' if success else 'Failed to delete'} TODO {todo_id}"
}
# Setup tool (hidden from agents)
@mcp.tool()
async def setup(initial_todos: list[str] = None) -> dict:
"""Initialize TODO list with optional items"""
app.clear_all()
if initial_todos:
for title in initial_todos:
app.add_todo(title)
return {
"status": "success",
"message": f"Initialized with {len(initial_todos or [])} TODOs"
}
# Evaluation tool (hidden from agents)
@mcp.tool()
async def evaluate(target_completed: int = None) -> dict:
"""Evaluate based on TODO completion"""
todos = app.list_todos()
total = len(todos)
completed = sum(1 for t in todos if t["completed"])
if target_completed is not None:
# Specific target
reward = 1.0 if completed >= target_completed else completed / target_completed
else:
# General completion rate
reward = completed / total if total > 0 else 0.0
return {
"reward": reward,
"done": reward >= 1.0,
"info": {
"completed": completed,
"total": total,
"details": f"{completed}/{total} TODOs completed"
}
}
# Initialize handler
@mcp.initialize
async def on_initialize(ctx):
"""Called when MCP connection is established"""
logging.info("TODO environment ready!")
if __name__ == "__main__":
mcp.run()
```
## Step 4: Package Configuration
Create a `pyproject.toml` file:
```toml pyproject.toml
[project]
name = "todo-env"
version = "0.1.0"
description = "TODO app MCP environment"
requires-python = ">=3.11"
dependencies = [
"hud-python>=0.5.0",
]
[build-system]
requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"
```
## Step 5: Docker Configuration
Create the Dockerfile:
```dockerfile Dockerfile
FROM python:3.11-slim
# Prevent Python from buffering output
ENV PYTHONUNBUFFERED=1 \
PYTHONDONTWRITEBYTECODE=1
WORKDIR /app
# Copy package files
COPY pyproject.toml ./
COPY src/ ./src/
# Install in editable mode for development
RUN pip install --no-cache-dir -e .
# Run as a module
CMD ["python", "-m", "hud_controller.server"]
```
## Step 6: Build and Test
### Option 1: Traditional Build & Test
Build your Docker image:
```bash
docker build -t todo-env .
```
Test with HUD CLI:
```bash
# Run the 5-phase debug check
hud debug todo-env
# If that passes, analyze the tools
hud analyze todo-env
```
### Option 2: Hot-Reload Development (Recommended)
Use the development proxy for instant feedback:
```bash
# Start development server with hot-reload
hud dev --build
# In another terminal, test your environment
hud analyze hud-todo-env:dev
```
Now you can edit `src/hud_controller/server.py` and see changes instantly without rebuilding!
Your environment should pass all 5 debug phases and show the available tools
## Step 7: Create Tasks
Now create tasks for agents to solve:
```python
from hud.datasets import Task
tasks = [
Task(
prompt="Add three TODOs: buy groceries, call mom, finish report",
mcp_config={
"local": {
"command": "docker",
"args": ["run", "--rm", "-i", "todo-env:latest"]
}
},
setup_tool={"name": "setup"},
evaluate_tool={
"name": "evaluate",
"arguments": {"target_completed": 3}
}
),
Task(
prompt="Complete all TODOs that contain the word 'urgent'",
mcp_config={
"local": {
"command": "docker",
"args": ["run", "--rm", "-i", "todo-env:latest"]
}
},
setup_tool={
"name": "setup",
"arguments": {
"initial_todos": [
"urgent: fix bug",
"buy coffee",
"urgent: submit report"
]
}
},
evaluate_tool={
"name": "evaluate",
"arguments": {"target_completed": 2}
}
)
]
```
## Step 8: Test with an Agent
Run your first evaluation:
```python
import asyncio
import hud
from hud.agents import ClaudeAgent
async def test_todo_env():
task = tasks[0] # First task
with hud.trace("todo-test"):
agent = ClaudeAgent()
result = await agent.run(task)
print(f"Success: {result.reward == 1.0}")
print(f"Details: {result.info}")
asyncio.run(test_todo_env())
```
## Publishing Your Environment
Once tested, publish to Docker Hub:
```bash
# Tag your image
docker tag todo-env your-dockerhub/todo-env:v1.0
# Push to registry
docker push your-dockerhub/todo-env:v1.0
```
### Using Your Environment Remotely
Others can now use your environment at scale:
```python
import os
from hud.datasets import Task
from hud.agents import ClaudeAgent
# Remote MCP configuration
task = Task(
prompt="Create 3 todos and complete the urgent ones",
mcp_config={
"hud": {
"url": "https://mcp.hud.so/v3/mcp",
"headers": {
"Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
"Mcp-Image": "your-dockerhub/todo-env:v1.0"
}
}
},
setup_tool={"name": "setup"},
evaluate_tool={"name": "evaluate", "arguments": {"target_completed": 2}}
)
# Run with any agent
agent = ClaudeAgent()
result = await agent.run(task)
```
## Next Steps
This tutorial covered the basics. For production environments, explore:
Full 5-step process with advanced patterns
Hub pattern, telemetry, and production tips
The official [environments README](https://github.com/hud-evals/hud-python/blob/main/environments/README.md) contains comprehensive implementation details, troubleshooting tips, and reference examples.
# Advanced Design
Source: https://docs.hud.so/build-environments/registry
Production patterns for MCP environments
Learn advanced patterns for building production-ready MCP environments. These techniques are used in environments like `remote_browser` and enable sophisticated agent interactions.
## The Hub Pattern
The hub pattern uses BaseHub, which is actually an inner MCP server that can be mounted to your main server. This provides namespace separation and modular organization.
### How Hubs Work
A hub is an MCP server that exposes a single tool. When you call this tool, you specify which function within the hub to execute:
```json
{
"name": "setup",
"arguments": {
"name": "navigate_to_page",
"arguments": {
"url": "https://example.com"
}
}
}
```
The outer `name` is the hub tool, inner `name` is the specific function, and inner `arguments` are passed to that function.
### Implementation
```python
# setup/__init__.py
from hud.tools.base import BaseHub
setup = BaseHub("setup")
# Import all setup modules
from . import navigation, data, auth
__all__ = ["setup"]
```
```python
# setup/navigation.py
from . import setup
@setup.tool()
async def navigate_to_page(url: str):
"""Navigate browser to specified URL."""
env = setup.env # Access shared environment
await env.browser.goto(url)
return {"status": "success", "url": url}
@setup.tool()
async def reset_browser():
"""Clear cookies and return to blank page."""
env = setup.env
await env.browser.clear_cookies()
await env.browser.goto("about:blank")
```
```python
# server.py
from .setup import setup as setup_hub
from .evaluate import evaluate as evaluate_hub
mcp = MCPServer(name="my-environment")
@mcp.initialize
async def initialize_environment(ctx):
# Create environment context
env = await create_environment()
# Set environment on hubs
setup_hub.env = env
evaluate_hub.env = env
# Mount hubs to server
mcp.mount(setup_hub)
mcp.mount(evaluate_hub)
```
Benefits:
* **Modular organization**: Each function in its own file
* **Namespace separation**: Tools grouped logically
* **Shared state**: Access environment via `hub.env`
* **Auto-discovery**: Import registers all decorated functions
## Progress Notifications
For long-running initialization, provide real-time feedback:
```python
@mcp.initialize
async def initialize_environment(ctx):
"""Initialize with progress updates."""
# Extract progress token from context
progress_token = getattr(ctx.meta, "progressToken", None) if ctx.meta else None
async def send_progress(progress: int, message: str):
if progress_token:
await ctx.session.send_progress_notification(
progress_token=progress_token,
progress=progress,
total=100,
message=message
)
await send_progress(10, "Starting services...")
await start_x11_server()
await send_progress(40, "Launching browser...")
browser = await launch_browser()
await send_progress(70, "Connecting to cloud provider...")
await connect_provider()
await send_progress(100, "Environment ready!")
```
This shows live in the HUD platform during container startup.
## Dynamic Configuration via Agent Metadata
Agents can pass metadata that gets injected into the MCP server's initialization context using `patch_mcp_config`:
```python
# In your agent
from hud.agents import ClaudeAgent
agent = ClaudeAgent(
metadata={
"display_width": 1280,
"display_height": 720,
"browser_provider": "anchorbrowser"
}
)
# This metadata is patched into mcp_config via patch_mcp_config
# See hud/agents/base.py#L476-L479
```
### How It Works
The agent's `_setup_config` method uses `patch_mcp_config` to inject metadata:
```python
from hud.utils.mcp import MCPConfigPatch, patch_mcp_config
async def _setup_config(self, mcp_config: dict[str, dict[str, Any]]) -> None:
"""Inject metadata into the metadata of the initialize request."""
if self.metadata:
patch_mcp_config(
mcp_config,
MCPConfigPatch(meta=self.metadata),
)
```
### Using Metadata in Your Environment
```python
@mcp.initialize
async def initialize_environment(ctx):
# Access metadata from context
metadata = ctx.meta
# Adjust display resolution
width = metadata.get("display_width", 1920)
height = metadata.get("display_height", 1080)
# Select browser provider
provider = metadata.get("browser_provider", "chromium")
# Configure performance
headless = metadata.get("headless", False)
# Apply configuration
browser_config = {
"viewport": {"width": width, "height": height},
"headless": headless,
"args": ["--disable-gpu"] if headless else []
}
browser = await launch_browser(browser_config)
```
See [patch\_mcp\_config implementation](https://github.com/hud-evals/hud-python/blob/main/hud/utils/mcp.py#L17-L35) for details.
## Live Telemetry
Expose real-time data via resources:
```python
@mcp.resource("telemetry://live")
async def get_telemetry():
"""Provide live environment status."""
return {
"status": "running",
"live_url": get_live_view_url(), # VNC or browser URL
"metrics": {
"memory_usage": get_memory_usage(),
"active_connections": count_connections(),
"uptime": get_uptime()
},
"logs": get_recent_logs(lines=50)
}
```
This appears in the HUD platform UI for debugging.
## VNC Server Integration
For GUI environments, provide visual access:
```python
# Start from base image with VNC
FROM hudpython/novnc-base:latest
# Your GUI app installation
RUN apt-get update && apt-get install -y your-gui-app
# VNC is already configured on ports 5901 (VNC) and 8080 (noVNC)
```
In your MCP server:
```python
@mcp.resource("telemetry://live")
async def get_telemetry():
# Provide noVNC URL for browser viewing
return {
"live_url": "http://localhost:8080/vnc.html",
"vnc_port": 5901,
"password": "vncpass" # Set via VNC_PW env var
}
```
## Using Base Images
HUD provides optimized base images:
### hudpython/novnc-base
* Ubuntu with XFCE desktop
* VNC and noVNC pre-configured
* Common utilities installed
```dockerfile
FROM hudpython/novnc-base:latest
# Add your GUI application
RUN apt-get update && apt-get install -y \
libreoffice \
firefox \
your-app
# VNC auto-starts with your app
```
### hudpython/browser-base
* Chromium and Playwright installed
* Optimized for browser automation
```dockerfile
FROM hudpython/browser-base:latest
# Add your browser automation code
COPY src/ /app/src/
CMD ["python", "-m", "src.server"]
```
## Phase 6: Automated Improvement
Enable Cursor Agent to improve your environment:
### 1. Development Configuration
Use [reloaderoo](https://github.com/cameroncooke/reloaderoo) for hot-reload development:
```json
// .cursor/mcp.json
{
"mcpServers": {
"my-env-dev": {
"command": "npx",
"args": [
"reloaderoo", "--", // Enables hot-reload without restarting client
"docker", "run", "-i", "--rm",
"-v", "$(pwd)/src:/app/src:rw",
"-e", "PYTHONDEBUG=1",
"my-env:dev"
]
}
}
}
```
### 2. Cursor Rules
Create `.cursor/rules/mcp_environment_iteration.mdc`:
```markdown
---
description: Improve an MCP environment
alwaysApply: false
---
Setup
1. Ensure dev mode MCP config exists
2. Explore environment structure
3. Clarify improvement objectives
Iteration
1. Test current functionality via tools
2. Note errors and improvements in TODOs
3. Fix issues in source code
4. Call restart_server to reload
5. Verify fixes work correctly
6. Continue until all TODOs complete
```
### 3. Improvement Workflow
Prompt Cursor:
```
Context: I have a browser environment in my-env folder
Tools: Use my-env-dev to interact with it
Task: Test all evaluators and fix any that fail
```
The agent will:
1. Explore available tools
2. Test each function
3. Fix issues in your code
4. Reload and verify
5. Continue iterating
## Example: Production Browser Environment
Here's how `remote_browser` implements these patterns:
```python
# Modular structure
src/hud_controller/
├── server.py # Main MCP server with progress
├── providers/ # Cloud browser integrations
├── setup/ # Hub with navigation, cookies
├── evaluate/ # Hub with page checks
└── tools/ # Playwright and computer tools
# Advanced features
- Progress notifications during startup
- Live browser URL in telemetry
- Dynamic resolution from metadata
- Multiple provider support
- Automatic cleanup on shutdown
```
## Next Steps
Complete guide with all implementation details
Study production environments
# Architecture
Source: https://docs.hud.so/core-concepts/architecture
How HUD connects agents, environments, and evaluation
HUD is built as a thin orchestration layer on top of MCP, providing structure for agent evaluation, environment management, and telemetry.
## System Overview
```mermaid
graph TB
subgraph "Remote"
Dashboard["📊 Dashboard
(app.hud.so)"]
API["🔌 MCP Orchestrator
(mcp.hud.so)"]
end
subgraph "Your Code"
Agent["🤖 Agent
(Claude/Operator)"]
Task["📋 Task
(Prompt + Evaluation)"]
SDK["📦 hud-python"]
end
subgraph "Environments"
LocalEnv["🖥️ Local Docker
(Development)"]
RemoteEnv["☁️ Remote Docker
(100s Parallel)"]
end
subgraph "OpenTelemetry"
Trace["📡 Traces & Metrics"]
end
Dataset["📚 Dataset
(HuggingFace)"]
AnyMCP["🔗 Any MCP Client
(Cursor, Claude, Custom)"]
Agent <--> SDK
Task --> SDK
Dataset <-.-> Task
SDK <-->|"MCP"| LocalEnv
SDK <-->|"MCP"| API
API <-->|"MCP"| RemoteEnv
SDK -->|"hud.trace"| Trace
Trace --> Dashboard
AnyMCP -->|"MCP"| API
style Dashboard fill:#e0e7ff,stroke:#6366f1,stroke-width:2px
style SDK fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
style RemoteEnv fill:#d1fae5,stroke:#10b981,stroke-width:2px
style AnyMCP fill:#fce7f3,stroke:#ec4899,stroke-width:2px,stroke-dasharray: 5 5
```
## Core Components
### 1. Agents (`hud.agents`)
Agents make decisions and call tools:
```python
from hud.agents import ClaudeAgent, OperatorAgent
# Built-in agents
agent = ClaudeAgent() # Uses Anthropic's Claude
agent = OperatorAgent() # Uses OpenAI Operator
# All agents inherit from MCPAgent
class MCPAgent:
async def initialize(task: str | Task | None)
async def run(prompt_or_task: str | Task, max_steps: int = 10) -> Trace
async def call_tools(tool_calls: list[MCPToolCall]) -> list[MCPToolResult]
```
Agents can auto-create MCP clients from `task.mcp_config` - no manual client setup needed
### 2. Tasks (`hud.Task`)
Tasks define what agents should accomplish:
```python
from hud.datasets import Task
task = Task(
prompt="Complete 3 items in the TODO list",
mcp_config={
"hud": {
"url": "https://mcp.hud.so/v3/mcp",
"headers": {
"Authorization": f"Bearer {api_key}",
"Mcp-Image": "hudpython/hud-browser:latest"
}
}
},
# Tool names and arguments map directly to MCP server tools
setup_tool={
"name": "setup", # Calls def setup(name, num_items) on the environment
"arguments": {
"name": "todo_seed",
"num_items": 5
}
},
evaluate_tool={
"name": "evaluate", # Calls def evaluate(name, expected_count) on the environment
"arguments": {
"name": "todo_completed",
"expected_count": 3
}
}
)
```
The `name` and `arguments` in setup/evaluate tools correspond exactly to the tool names and parameters exposed by the MCP server
### 3. MCP Clients (`hud.clients`)
Clients handle the MCP protocol:
```python
from hud.clients import MCPClient
# Auto-created by agents, or manual:
client = MCPClient(mcp_config)
await client.initialize()
tools = await client.list_tools()
```
### 4. Environments
Environments are MCP servers exposing tools:
```python
from hud.server import MCPServer
server = MCPServer("my-env")
@server.tool()
def move(direction: str):
# Environment logic
return result
```
### 5. Telemetry (`hud.trace`)
Real-time observability:
```python
import hud
with hud.trace("my-evaluation"):
result = await agent.run(task)
# View at app.hud.so/traces/{trace_id}
```
## Execution Flow
Create a `Task` with prompt and MCP configuration
Agent creates MCP client (if needed) and connects to environment
Execute `setup_tool` to initialize environment state
Agent receives observations, makes decisions, calls tools
Execute `evaluate_tool` to score performance
All interactions streamed to HUD backend for analysis
## Key Design Principles
1. **Protocol-First**: Everything speaks MCP
2. **Composable**: Mix and match agents, environments, evaluations
3. **Observable**: Built-in telemetry for every interaction
4. **Testable**: Reproducible evaluations with Docker
5. **Extensible**: Easy to add new agents or environments
The `MCPServer` class wraps FastMCP with lifecycle management, making it easy to build Docker-based environments
## Next Steps
Test and benchmark agents on standardized tasks
Create MCP-compatible environments for your software
# MCP Protocol
Source: https://docs.hud.so/core-concepts/mcp-protocol
Understanding the Model Context Protocol that connects agents to environments
The Model Context Protocol (MCP) is the skeleton hud uses to provide a standard way for AI agents to interact with any software environment through tool calls.
## Why MCP?
Traditional agent frameworks couple agents tightly to specific environments. MCP decouples them:
* Agent code hardcoded for each environment
* No standardization across tools
* Difficult to swap agents or environments
* Any agent works with any environment
* Standard protocol for all interactions
* Easy to swap components
## How It Works
MCP standardizes agent-environment communication through JSON-RPC messages. Agents call tools exposed by environments and receive structured responses.
## Core Concepts
### Tools
[Tools](https://modelcontextprotocol.io/specification/server/tools) are functions exposed by the environment:
```json
{
"name": "move",
"description": "Move in a direction",
"inputSchema": {
"type": "object",
"properties": {
"direction": {
"type": "string",
"enum": ["up", "down", "left", "right"]
}
}
}
}
```
### Tool Calls & Results
Agents [call tools](https://modelcontextprotocol.io/specification/server/tools#calling-tools) and receive results:
```python
# Agent makes a tool call
result = await client.call_tool("move", {"direction": "up"})
# Environment returns result
{
"content": [{
"type": "text",
"text": "Moved up. New board state: ..."
}]
}
```
### Lifecycle Management
MCP defines a [rigorous lifecycle](https://modelcontextprotocol.io/specification/basic/lifecycle) for connections:
1. **Initialization**: Client and server negotiate capabilities and protocol version with `client.initialize()`
2. **Operation**: Normal tool calling and message exchange
3. **Shutdown**: Clean termination of the connection
The protocol ensures both sides understand each other's capabilities before proceeding.
## HUD's MCP Extensions
HUD adds conventions on top of MCP:
1. **Setup Tools**: Initialize environment state (`setup_board`, `navigate_to_url`)
2. **Evaluate Tools**: Score agent performance (`evaluate_max_tile`, `contains_text`)
3. **Lifecycle Management**: Clean initialization and shutdown with `client.initialize()` and proper cleanup
See the [Tools Reference](/reference/tools) for implementation details.
## Transport Options
HUD environments are designed for 100% reproducibility through Docker:
Run environments locally for development and debugging using [stdio transport](https://modelcontextprotocol.io/specification/basic/transports#stdio):
```python
mcp_config = {
"hud-text-2048": {
"command": "docker",
"args": ["run", "--rm", "-i", "hudpython/hud-text-2048:v1.2"]
}
}
```
* **Transport**: [stdio](https://modelcontextprotocol.io/specification/basic/transports#stdio) - JSON-RPC over stdin/stdout
* **Pros**: Full control, easy debugging, no network latency
* **Use case**: Development, testing, single-agent runs
The `-i` flag enables Docker's interactive mode, allowing stdio communication between the client and server process.
Scale to parallel runs with cloud infrastructure using [Streamable HTTP transport](https://modelcontextprotocol.io/specification/basic/transports#streamable-http):
```python
mcp_config = {
"hud": {
"url": "https://mcp.hud.so/v3/mcp",
"headers": {
"Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
"Mcp-Image": "hudpython/hud-text-2048:v1.2"
}
}
}
```
* **Transport**: [Streamable HTTP](https://modelcontextprotocol.io/specification/basic/transports#streamable-http) with SSE
* **Pros**: Infinite parallelization, no local setup, managed infrastructure
* **Use case**: Benchmarks, CI/CD, multi-agent experiments
HUD's orchestrator manages Docker containers in the cloud, exposing them via HTTP endpoints with Server-Sent Events for streaming.
Both approaches use the exact same Docker image, ensuring identical behavior whether running locally or remotely
## Next Steps
See how HUD builds on MCP for agent evaluation
# Task System
Source: https://docs.hud.so/core-concepts/task-system
How tasks define agent objectives and evaluation criteria
Tasks define what agents should do and how to measure success. They combine a prompt, environment configuration, and optional setup/evaluation phases.
## Lifecycle
```mermaid
graph LR
Task["📋 Task
(Prompt + Config)"]
Setup["🔧 Setup Phase
(Initialize State)"]
Execute["🤖 Execute Phase
(Agent Works)"]
Evaluate["📊 Evaluate Phase
(Score Result)"]
Task --> Setup
Setup --> Execute
Execute --> Evaluate
style Task fill:#e0e7ff,stroke:#6366f1,stroke-width:2px
style Execute fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
style Evaluate fill:#d1fae5,stroke:#10b981,stroke-width:2px
```
## Task Structure
```python
from hud.datasets import Task
import uuid
task = Task(
# Required fields
prompt="Navigate to the login page and sign in as testuser@example.com",
mcp_config={
"hud": {
"url": "https://mcp.hud.so/v3/mcp",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Mcp-Image": "hudpython/hud-browser:latest"
}
}
},
# Optional fields
id=str(uuid.uuid4()), # Required for HuggingFace datasets
system_prompt="You are an expert web automation agent. Always verify page loads before interacting with elements.",
setup_tool={
"name": "playwright",
"arguments": {
"action": "navigate",
"url": "https://example.com"
}
},
evaluate_tool={
"name": "evaluate",
"arguments": {
"name": "url_contains",
"substring": "/dashboard"
}
},
metadata={"category": "authentication", "difficulty": "easy"}
)
```
## Field Reference
The instruction given to the agent. Be specific and clear about success criteria.
Environment connection configuration. Supports environment variables with `${VAR}` syntax.
Unique identifier. Required when creating HuggingFace datasets. Use `str(uuid.uuid4())`.
Custom system prompt for the agent. Overrides the agent's default system prompt.
Tool(s) to initialize environment state with `name` and `arguments`
Tool(s) to score performance with `name` and `arguments`
Additional task information for analysis and filtering
## Environment Variables
Tasks automatically resolve `${VAR}` patterns when deserialized from dictionaries. This is why HuggingFace datasets should store raw dictionaries, not Task objects.
```python
# In dataset JSON:
{
"prompt": "Complete the TODO list",
"mcp_config": {
"hud": {
"url": "https://mcp.hud.so/v3/mcp",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Mcp-Image": "${BROWSER_IMAGE}"
}
}
}
}
# When loaded:
task = Task(**task_dict) # Variables resolved here!
# Now task.mcp_config["hud"]["headers"]["Authorization"] = "Bearer sk-hud-..."
```
This enables:
* **Public datasets** without exposing secrets
* **Environment-specific** configurations
* **CI/CD pipelines** with different credentials
## Running Tasks
```python
# Agent automatically handles all phases
result = await agent.run(task)
print(f"Success: {result.reward}") # 0.0 to 1.0
```
The agent will:
1. Execute `setup_tool` if provided
2. Work on the `prompt` using available tools
3. Execute `evaluate_tool` to calculate reward
## Working with Datasets
Tasks integrate with HuggingFace datasets:
```python
from datasets import load_dataset
from hud.datasets import run_dataset
# Load dataset from HuggingFace
dataset = load_dataset("hud-evals/sheetbench-50", split="train")
# Run agent on entire dataset
results = await run_dataset(
"SheetBench Run",
dataset, # Pass the dataset object or name (e.g. "hud-evals/sheetbench-50")
agent_class=ClaudeAgent,
max_concurrent=5
)
```
## Creating Datasets
```python
from hud.datasets import save_tasks
# Create task dictionaries (NOT Task objects!)
task_dicts = [
{
"prompt": "Navigate to the login page",
"mcp_config": {
"hud": {
"url": "${MCP_URL}",
"headers": {"Authorization": "Bearer ${HUD_API_KEY}"}
}
},
"setup_tool": {"name": "playwright", "arguments": {"action": "navigate", "url": "https://example.com"}},
"evaluate_tool": {"name": "url_match", "arguments": {"pattern": ".*/login"}}
},
# More task dicts...
]
# Save to HuggingFace (preserves ${VAR} templates)
save_tasks(task_dicts, "my-org/my-benchmark")
```
Always save dictionaries, not Task objects. Task objects have already resolved environment variables!
## Best Practices
1. **Use UUIDs**: Always include `id=str(uuid.uuid4())` for dataset tasks
2. **Clear Prompts**: Be specific about success criteria
3. **Template Variables**: Use `${VAR}` syntax for shareable configs
4. **Rich Tools**: Include both `name` and `arguments` in tool definitions
## Next Steps
Use tasks with standard benchmarks
Create your own evaluation scenarios
# Create Agents
Source: https://docs.hud.so/evaluate-agents/create-agents
Build your own MCP-compatible agent for HUD evaluation
Create custom agents that work with HUD's evaluation system. You can either extend the `MCPAgent` class for full control, or use the quickstart template for a simpler approach.
## Quick Start with Template
The fastest way to create an agent is using the HUD quickstart template:
```bash
# Clone the quickstart repository
uvx hud-python quickstart
# This creates a full agent implementation with:
# - Complete lifecycle example
# - Direct tool call handling
# - Ready-to-run code
```
The quickstart provides a working agent that you can modify to add your own logic for making tool calls.
## Building from Scratch
For full control, create an agent by inheriting from `MCPAgent` and implementing four core methods:
```python
from hud.agents import MCPAgent
from hud.types import AgentResponse, MCPToolCall, MCPToolResult
class MyAgent(MCPAgent):
"""Your custom agent implementation."""
async def get_system_messages(self) -> list[Any]:
"""Return system messages for your LLM."""
pass
async def get_response(self, messages: list[Any]) -> AgentResponse:
"""Generate agent response with tool calls."""
pass
async def format_blocks(self, blocks: list[ContentBlock]) -> list[Any]:
"""Format content blocks for your LLM."""
pass
async def format_tool_results(
self, tool_calls: list[MCPToolCall],
tool_results: list[MCPToolResult]
) -> list[Any]:
"""Format tool results back into messages."""
pass
```
## Understanding the Interface
The methods provide a bridge between two worlds:
* **`messages: list[Any]`** - Your LLM's native message format (e.g., OpenAI's format)
* **`blocks: list[ContentBlock]`** - MCP's content blocks (text, images, etc.)
* **`tool_calls: list[MCPToolCall]`** and **`tool_results: list[MCPToolResult]`** - Native MCP types
These methods translate between your LLM's format and MCP's standardized format.
## Implementation Guide
### 1. Get System Messages
Define the system prompt for your agent:
```python
async def get_system_messages(self) -> list[Any]:
"""
Return system messages in your LLM's format.
"""
# Use the configured system prompt or default
prompt = self.system_prompt or "You are a helpful assistant."
# Return in your LLM's expected format
return [{"role": "system", "content": prompt}]
```
### 2. Get Response Method
This is where your agent decides what to do:
```python
async def get_response(self, messages: list[Any]) -> AgentResponse:
"""
Call your LLM and return tool calls.
Args:
messages: Conversation history in your LLM's format
Returns:
AgentResponse with content and tool_calls
"""
# Call your LLM API
response = await self.llm_client.chat(
messages=messages,
tools=self._available_tools # Provided by MCPAgent
)
# Parse response into tool calls
tool_calls = []
if response.tool_calls:
for tc in response.tool_calls:
tool_calls.append(
MCPToolCall(
name=tc.function.name,
arguments=json.loads(tc.function.arguments),
id=tc.id # Optional tool call ID
)
)
return AgentResponse(
content=response.content,
tool_calls=tool_calls
)
```
### 3. Format Blocks Method
Convert MCP's content blocks into your LLM's message format:
```python
async def format_blocks(self, blocks: list[ContentBlock]) -> list[Any]:
"""
Format content blocks from tools into messages.
Args:
blocks: List of MCP ContentBlock objects
Returns:
Messages in your LLM's expected format
"""
# Example for OpenAI-style format
content = []
for block in blocks:
if block.type == "text":
content.append({"type": "text", "text": block.text})
elif block.type == "image":
content.append({
"type": "image_url",
"image_url": {"url": f"data:{block.mime_type};base64,{block.data}"}
})
return [{"role": "user", "content": content}]
```
### 4. Format Tool Results Method
Convert tool execution results back into messages:
```python
async def format_tool_results(
self,
tool_calls: list[MCPToolCall],
tool_results: list[MCPToolResult]
) -> list[Any]:
"""
Format tool results for the next LLM call.
Args:
tool_calls: The MCP tool calls that were made
tool_results: MCP results from executing those tools
Returns:
Messages to append to conversation
"""
messages = []
for call, result in zip(tool_calls, tool_results):
# Add the assistant's tool call
messages.append({
"role": "assistant",
"content": None,
"tool_calls": [{
"id": call.id or str(uuid.uuid4()),
"type": "function",
"function": {
"name": call.name,
"arguments": json.dumps(call.arguments)
}
}]
})
# Add the tool result
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": json.dumps(result.content)
})
return messages
```
## Custom Agent Implementation
To create a custom agent, implement the `MCPAgent` protocol by defining these four core methods:
```python
from hud.agents import MCPAgent
from hud.types import AgentResponse, MCPToolCall, MCPToolResult
from typing import Any
import json
class CustomAgent(MCPAgent):
"""Custom agent implementation for any LLM provider."""
def __init__(self, model: str = "your-model", **kwargs):
super().__init__(**kwargs)
self.model = model
# Initialize your LLM client here
# self.client = YourLLMClient()
async def get_system_messages(self) -> list[Any]:
"""Return system messages for your LLM.
This method should return the system prompt and any other
initialization messages in your LLM's expected format.
"""
prompt = self.system_prompt or "You are a helpful assistant that uses tools to complete tasks."
return [{"role": "system", "content": prompt}]
async def get_response(self, messages: list[Any]) -> AgentResponse:
"""Generate agent response with tool calls.
This is where you call your LLM and parse the response into
tool calls. Convert available MCP tools to your LLM's format.
"""
# Convert MCP tools to your LLM's tool format
tools = [
{
"name": tool.name,
"description": tool.description,
"parameters": tool.inputSchema
}
for tool in self._available_tools
]
# Call your LLM with tools
response = await self.call_llm(messages, tools)
# Parse tool calls from response
tool_calls = self.parse_tool_calls(response)
return AgentResponse(
content=response.get("content", ""),
tool_calls=tool_calls
)
async def format_blocks(self, blocks: list[Any]) -> list[Any]:
"""Format content blocks for your LLM.
Convert MCP content blocks (text, images) into your LLM's
message format.
"""
messages = []
for block in blocks:
if block.type == "text":
messages.append({"role": "user", "content": block.text})
elif block.type == "image":
# Handle image blocks if your LLM supports them
messages.append({
"role": "user",
"content": [
{"type": "image", "image": block.data}
]
})
return messages
async def format_tool_results(
self,
tool_calls: list[MCPToolCall],
tool_results: list[MCPToolResult]
) -> list[Any]:
"""Format tool results for your LLM.
Convert tool call results back into your LLM's message format.
"""
messages = []
# Add assistant message with tool calls
messages.append({
"role": "assistant",
"tool_calls": [
{
"id": call.id,
"name": call.name,
"arguments": call.arguments
}
for call in tool_calls
]
})
# Add tool results
for call, result in zip(tool_calls, tool_results):
content = json.dumps(result.content) if result.content else ""
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": content
})
return messages
# Helper methods you'll need to implement
async def call_llm(self, messages: list[Any], tools: list[Any]) -> dict:
"""Call your LLM API and return the response."""
# Implement your LLM API call here
pass
def parse_tool_calls(self, response: dict) -> list[MCPToolCall]:
"""Parse tool calls from your LLM's response format."""
# Implement parsing logic for your LLM's tool call format
pass
```
## Testing Your Agent
Test your agent on a simple task:
```python
import asyncio
import hud
import os
from hud.datasets import Task
async def test_agent():
with hud.trace("test-custom-agent"):
task = Task(
prompt="Navigate to example.com",
mcp_config={
"hud": {
"url": "https://mcp.hud.so/v3/mcp",
"headers": {
"Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
"Mcp-Image": "hudpython/hud-remote-browser:latest"
}
}
},
setup_tool={
"name": "setup",
"arguments": {
"name": "navigate",
"arguments": {"url": "https://example.com"}
}
},
evaluate_tool={
"name": "evaluate",
"arguments": {
"name": "url_match",
"arguments": {"pattern": "example.com"}
}
}
)
# Use your custom agent
agent = CustomAgent(model="your-model")
result = await agent.run(task)
print(f"Reward: {result.reward}")
asyncio.run(test_agent())
```
## Built-in Agents
For quick prototyping, HUD provides built-in agents:
```python
from hud.agents import ClaudeAgent, OperatorAgent
# Claude (Anthropic)
claude_agent = ClaudeAgent(
allowed_tools=["anthropic_computer"]
)
# OpenAI
openai_agent = OperatorAgent(
allowed_tools=["openai_computer"]
)
```
These are great for getting started, but implementing your own agent gives you full control over the conversation flow, tool selection, and response formatting.
## Advanced Features
### Custom System Prompts
```python
class MyAgent(MCPAgent):
def __init__(self, **kwargs):
super().__init__(
system_prompt="You are an expert web automation agent.",
**kwargs
)
```
### Tool Filtering
```python
# Only allow specific tools
agent = MyAgent(
allowed_tools=["click", "type", "playwright"],
disallowed_tools=["execute_script"]
)
```
### Response Agent
The ResponseAgent is an OpenAI-powered helper that determines whether an agent should stop or continue based on the agent's messages. It's useful for handling ambiguous situations:
```python
from hud.agents.misc import ResponseAgent
# ResponseAgent analyzes agent messages like:
# "I've completed the form. Should I submit it?"
# And returns "STOP" or "CONTINUE"
agent = MyAgent(
response_agent=ResponseAgent() # Requires OPENAI_API_KEY
)
```
## Next Steps
Track and compare agent performance
Build custom evaluation datasets
# Create Benchmarks
Source: https://docs.hud.so/evaluate-agents/create-benchmarks
Build evaluation datasets using existing environments or custom setups
Create benchmarks to evaluate agent capabilities systematically. HUD makes it easy to build evaluation datasets using existing environments or by creating entirely custom setups.
## Using Existing Environments
The fastest way to create benchmarks is to leverage existing HUD environments and their evaluators.
### 1. Explore Available Evaluators
Use `hud analyze` to discover what's available:
```bash
$ hud analyze hudpython/hud-remote-browser:latest
📋 Evaluators:
├── url_match(pattern: str) - Check if URL matches pattern
├── page_contains(text: str, regex: bool = False) - Check page content
├── element_exists(selector: str) - Check if CSS selector exists
├── cookie_exists(name: str) - Check if cookie exists
├── sheet_contains(cell: str, value: str) - Check spreadsheet cell
├── sheets_cell_values(args: dict) - Verify multiple cell values
└── ... more evaluators
📝 Setup Functions:
├── navigate(url: str) - Navigate to URL
├── cookies(action: str, ...) - Manage cookies
├── sheets(action: str, ...) - Spreadsheet operations
└── ... more setup functions
```
### 2. Create Tasks with Known Evaluators
Build tasks using the discovered evaluators:
```python
import uuid
from hud.datasets import Task
# Example: Web navigation benchmark
web_tasks = []
# Task 1: Simple navigation
web_tasks.append({
"id": str(uuid.uuid4()),
"prompt": "Navigate to the documentation page",
"mcp_config": {
"hud": {
"url": "https://mcp.hud.so/v3/mcp",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Mcp-Image": "hudpython/hud-remote-browser:latest"
}
}
},
"setup_tool": {
"name": "setup",
"arguments": {
"name": "navigate",
"arguments": {"url": "https://example.com"}
}
},
"evaluate_tool": {
"name": "evaluate",
"arguments": {
"name": "url_match",
"arguments": {
"pattern": ".*/docs.*"
}
}
},
"metadata": {"difficulty": "easy", "category": "navigation"}
})
# Task 2: Form interaction
web_tasks.append({
"id": str(uuid.uuid4()),
"prompt": "Submit the contact form with the name 'Test User' and email 'test@example.com'",
"setup_tool": {
"name": "setup",
"arguments": {
"name": "navigate",
"arguments": {"url": "https://example.com/contact"}
}
},
"evaluate_tool": {
"name": "evaluate",
"arguments": {
"name": "page_contains",
"arguments": {
"text": "Thank you for your submission"
}
}
},
"metadata": {"difficulty": "medium", "category": "forms"}
})
```
### 3. Test Your Benchmark
Test your benchmark with `run_dataset`:
```python
from hud.datasets import run_dataset
from hud.agents import ClaudeAgent
# Test your tasks
results = await run_dataset(
"Web Navigation Test",
web_tasks, # Your list of task dicts
agent_class=ClaudeAgent,
max_concurrent=2
)
# Check results
success_rate = sum(r.reward > 0.5 for r in results) / len(results)
print(f"Success rate: {success_rate:.2%}")
```
## Publishing Benchmarks
All benchmarks are stored as HuggingFace datasets.
### Save Your Dataset
```python
from hud.datasets import save_tasks
# Save to HuggingFace (requires HF token)
save_tasks(
web_tasks, # List of task dictionaries
repo_id="my-org/web-navigation-benchmark",
private=False, # Make it public
tags=["web", "navigation", "automation"]
)
```
### Manual Upload
You can also create datasets manually:
1. Save tasks as JSON:
```python
import json
with open("benchmark.json", "w") as f:
json.dump(form_tasks, f, indent=2)
```
2. Upload to HuggingFace:
* Go to [https://huggingface.co/new-dataset](https://huggingface.co/new-dataset)
* Upload your JSON file
* Add a README with benchmark description
### Leaderboard Creation
Leaderboards are automatically created when the first scorecard is published for your dataset. Simply run an evaluation and create a scorecard at `app.hud.so/leaderboards/{your-dataset-id}`.
## Extending Existing Environments
### Explore Available Evaluators
For deeper customization, explore existing evaluators and contribute new ones:
```python
# Example: Contributing a new evaluator
# In environments/remote_browser/src/hud_controller/evaluate/form_complete.py
from hud.tools.types import EvaluationResult
from . import evaluate
@evaluate.tool("form_complete")
async def form_complete(ctx, required_fields: list[str]):
"""Check if all required form fields are filled."""
# Access environment from the hub
env = evaluate.env # BrowserEnvironmentContext or similar
# Use environment methods
page_content = await env.get_page_content()
# Check each field (simplified example)
filled_count = 0
for field in required_fields:
if f'value="{field}"' in page_content or f'>{field}<' in page_content:
filled_count += 1
success = filled_count == len(required_fields)
return EvaluationResult(
reward=1.0 if success else filled_count / len(required_fields),
done=success,
info={"filled": filled_count, "required": len(required_fields)}
)
```
Submit evaluators via pull requests to expand environment capabilities.
## Creating Custom Environments
For complete control over evaluation logic, build your own environment.
Complete guide to creating Docker-based MCP environments with custom evaluators
## Example: SheetBench-50
Here's an actual task from SheetBench-50:
```python
# Example task from the dataset
sheetbench_task = {
"id": "6e4744c7-b2c9-4bb6-807e-2cc144a4e8c2",
"prompt": "Calculate from the RawData tab the z-scores from the mean close price for each row. Return, starting in ANSWER!A1 and descending to ANSWER!A5, the 5 dates with the greatest absolute value of standard deviations from the mean",
"mcp_config": {
"hud": {
"url": "https://mcp.hud.so/v3/mcp",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Run-Id": "${RUN_ID}",
"Mcp-Image": "hudpython/hud-remote-browser:v1.1"
}
}
},
"setup_tool": {
"name": "setup",
"arguments": {
"name": "sheets_from_xlsx",
"arguments": {
"file_url": "https://gahludmjcsmszgyufydt.supabase.co//storage/v1/object/public/sheetbench/c6ddeb9a-0c16-4f5e-8a06-f148ebb4be8a/setup_input_2.xlsx?"
}
}
},
"evaluate_tool": {
"name": "evaluate",
"arguments": {
"name": "sheets_cell_values",
"arguments": {
"args": {
"A1": "1/12/2024",
"A2": "1/10/2024",
"A3": "1/15/2024",
"A4": "1/11/2024",
"A5": "1/17/2024"
}
}
}
},
"metadata": {
"partial": True,
"gold_file_url": "https://gahludmjcsmszgyufydt.supabase.co//storage/v1/object/public/sheetbench/c6ddeb9a-0c16-4f5e-8a06-f148ebb4be8a/gold_solution_2.xlsx?"
}
}
# The dataset includes 50 such tasks covering:
# - Formula creation (VLOOKUP, SUMIF, etc.)
# - Data analysis (z-scores, correlations)
# - Data manipulation (sorting, filtering)
# - Chart creation
# - Pivot tables
```
## Best Practices
1. **Clear Instructions**: Write unambiguous prompts
2. **Isolated State**: Each task should be independent
3. **Measurable Success**: Use binary or graduated scoring
4. **Diverse Coverage**: Test different aspects of capability
5. **Difficulty Progression**: Include easy to hard tasks
6. **Metadata Tags**: Help users filter relevant tasks
## Next Steps
Track and compare agent performance on your benchmarks
Create custom MCP environments for unique evaluation scenarios
Use your datasets to train better agents with RL
# Evaluate Agents
Source: https://docs.hud.so/evaluate-agents/index
Test and benchmark your agents on standardized tasks
HUD makes it easy to evaluate any MCP-compatible agent on a variety of tasks and benchmarks. Whether you're testing Claude, Operator, or your custom agent, the evaluation flow is consistent.
## Quick Start
Evaluate an agent on a single task:
```python
import hud
import os
from hud.datasets import Task
from hud.agents import ClaudeAgent
with hud.trace("eval-demo"):
task = Task(
prompt="Create a spreadsheet with quarterly sales data",
mcp_config={
"hud": {
"url": "https://mcp.hud.so/v3/mcp",
"headers": {
"Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
"Mcp-Image": "hudpython/hud-browser:latest"
}
}
},
setup_tool={
"name": "playwright",
"arguments": {
"action": "navigate",
"url": "https://sheets.google.com"
}
},
evaluate_tool={
"name": "evaluate",
"arguments": {
"name": "sheet_contains",
"text": "Q1 Sales"
}
}
)
agent = ClaudeAgent()
result = await agent.run(task)
print(f"Success: {result.reward > 0.5}")
```
## What You Can Do
### 1. Create Tasks for Browser Automation
```python
import os
task = Task(
prompt="Navigate to GitHub and star the HUD repository",
mcp_config={
"hud": {
"url": "https://mcp.hud.so/v3/mcp",
"headers": {
"Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
"Mcp-Image": "hudpython/hud-browser:latest"
}
}
},
setup_tool={
"name": "setup",
"arguments": {
"name": "navigate",
"url": "https://github.com"
}
},
evaluate_tool={
"name": "evaluate",
"arguments": {
"name": "url_contains",
"substring": "hud-python"
}
}
)
```
### 2. Run Existing Benchmarks
```python
from datasets import load_dataset
from hud.datasets import run_dataset
# Load and run SheetBench-50
dataset = load_dataset("hud-evals/sheetbench-50", split="train")
results = await run_dataset(
"My SheetBench Run",
dataset,
agent_class=ClaudeAgent,
max_concurrent=5
)
```
### 3. Make Your Agent Work with HUD
To create a custom agent, inherit from `MCPAgent` and implement the required methods:
```python
from hud.agents import MCPAgent
from hud.types import AgentResponse, MCPToolCall
class MyCustomAgent(MCPAgent):
async def get_response(self, messages: list[Any]) -> AgentResponse:
# Call your LLM and return tool calls
...
async def format_blocks(self, blocks: list[Any]) -> list[Any]:
# Format content blocks into messages for your LLM
...
async def format_tool_results(
self, tool_calls: list[MCPToolCall],
tool_results: list[Any]
) -> list[Any]:
# Format tool results back into messages
...
# Now it works with any HUD dataset!
```
See complete examples and implementation details
## Available Benchmarks
50 real-world spreadsheet tasks testing data manipulation, formulas, and analysis
Desktop automation tasks across Ubuntu applications
Web navigation and interaction challenges (Coming Soon)
Strategic planning in the 2048 game environment
## Exploring Environments
Use `hud analyze` to discover available tools and evaluators:
```bash
$ hud analyze hudpython/hud-browser:latest
🔍 Analyzing hudpython/hud-browser:latest...
📊 Environment Summary:
├── Tools: 15 available
├── Setup Functions: 8 available
├── Evaluators: 12 available
└── Resources: 3 available
🛠️ Tools:
├── playwright(action: str, **kwargs) - Browser automation actions
├── click(selector: str) - Click element
├── type(selector: str, text: str) - Type text
└── ... 12 more tools
📋 Evaluators:
├── url_contains(substring: str) - Check if URL contains text
├── page_contains(text: str, regex: bool = False) - Check page content
├── element_exists(selector: str) - Check if CSS selector exists
├── todo_completed(expected_count: int) - Verify TODO completion
└── ... 8 more evaluators
Run with --json for full details or pipe to grep for filtering.
```
## Publishing to Leaderboards
After running evaluations, view results on the leaderboard:
```python
# Run evaluation
results = await run_dataset(
"Claude-3.5 SheetBench",
dataset="hud-evals/sheetbench-50",
agent_class=ClaudeAgent
)
# Then visit: app.hud.so/leaderboards/hud-evals/sheetbench-50
# Click "My Jobs" to see your runs and create scorecards
```
## Key Features
* **Reproducible**: Docker environments ensure consistency
* **Parallel**: Run multiple evaluations concurrently
* **Observable**: Every tool call tracked with telemetry
* **Extensible**: Easy to add new tasks or benchmarks
## Next Steps
Build your own MCP-compatible agent
Track and compare agent performance
Build custom evaluation datasets
# Leaderboards
Source: https://docs.hud.so/evaluate-agents/leaderboards
Track and compare agent performance on standardized benchmarks
HUD automatically tracks all benchmark runs and provides leaderboards for comparing agent performance across teams and models.
## Running Benchmarks
Execute evaluations on standard datasets:
```python
from hud.datasets import run_dataset
from hud.agents import ClaudeAgent
# Run a benchmark
results = await run_dataset(
name="Claude-3.5 SheetBench",
dataset="hud-evals/sheetbench-50",
agent_class=ClaudeAgent,
agent_config={"model": "claude-3-5-sonnet-20241022"},
max_concurrent=5
)
```
## Viewing Results
### On the Leaderboard Page
After running, view your results:
1. Navigate to `app.hud.so/leaderboards/{dataset-name}`
* Example: `app.hud.so/leaderboards/hud-evals/sheetbench-50`
2. Click "My Jobs" to see your evaluation runs
3. Select runs to analyze performance
### Local Analysis
Calculate metrics locally:
```python
# Basic metrics
success_rate = sum(r.reward > 0.5 for r in results) / len(results)
avg_reward = sum(r.reward for r in results) / len(results)
avg_duration = sum(r.duration for r in results) / len(results)
print(f"Success Rate: {success_rate:.2%}")
print(f"Average Reward: {avg_reward:.3f}")
print(f"Average Duration: {avg_duration:.1f}s")
```
## Creating Scorecards
Transform your evaluation run into a public scorecard:
Execute your agent on the full dataset
Go to `app.hud.so/leaderboards/{dataset-name}`
Click "My Jobs" and choose your evaluation run
Select your best runs (we compute statistics over multiple jobs) and create a scorecard
Your scorecard appears on the public leaderboard
## Available Benchmarks
Spreadsheet manipulation and analysis tasks
Desktop automation in Ubuntu environment
Strategic planning and game playing
Web navigation challenges (Coming Soon)
## Best Practices
1. **Agent Naming**: Ensure your agent class has a meaningful name - it will be displayed on the leaderboard
2. **Multiple Runs**: Select your best few runs when creating a scorecard (we compute statistics over them)
3. **Consistent Configuration**: Use the same model version and parameters for fair comparison
4. **Share Results**: Make your scorecards public to contribute to the community
Leaderboards are automatically created when the first scorecard is published for a dataset
## Next Steps
Build your own MCP-compatible agent
Build custom evaluation datasets
# Introduction
Source: https://docs.hud.so/index
OSS RL environment + evals toolkit.
Test Claude, Operator, or custom agents on benchmarks like SheetBench and OSWorld
Wrap any software in dockerized MCP for scalable and generalizable agent evaluation
Use reinforcement learning and GRPO on evaluations to improve agent performance
## What is HUD?
HUD connects AI agents to software environments using the Model Context Protocol (MCP). Whether you're evaluating existing agents, building new environments, or training models with RL, HUD provides the infrastructure.
```mermaid
graph LR
Agent["🤖 Any Agent
(Claude, Operator, etc.)"]
MCP["🔌 MCP Protocol
(Tool Calls)"]
Env["📦 Any Environment
(Browser, OS, etc.)"]
Agent -->|"call_tool()"| MCP
MCP -->|"click(x, y)"| Env
Env -->|"screenshot"| MCP
MCP -->|"get_response()"| Agent
style Agent fill:#e0e7ff,stroke:#6366f1,stroke-width:2px
style MCP fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
style Env fill:#d1fae5,stroke:#10b981,stroke-width:2px
```
## Why HUD?
* **🔌 MCP-native**: Any agent can connect to any environment
* **📡 Live telemetry**: Debug every tool call at [app.hud.so](https://app.hud.so)
* **🚀 Production-ready**: From local Docker to cloud scale
* **🎯 Built-in benchmarks**: OSWorld-Verified, SheetBench-50, and more
* **🔧 CLI tools**: Debug, analyze, create environments with `hud debug` and `hud analyze`
Run your first agent evaluation with zero setup
```bash
uvx hud-python quickstart
```
## Quick Example
```python
import asyncio, os, hud
from hud.datasets import Task
from hud.agents import ClaudeAgent
async def main():
# Define evaluation task with remote MCP
task = Task(
prompt="Win a game of 2048 by reaching the 128 tile",
mcp_config={
"hud": {
"url": "https://mcp.hud.so/v3/mcp",
"headers": {
"Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
"Mcp-Image": "hudpython/hud-text-2048:v1.2"
}
}
},
setup_tool={"name": "setup", "arguments": {"name": "board", "arguments": { "board_size": 4}}},
evaluate_tool={"name": "evaluate", "arguments": {"name": "max_number", "arguments": {"target": 64}}}
)
# Run agent (auto-creates MCP client)
agent = ClaudeAgent()
result = await agent.run(task)
print(f"Score: {result.reward}")
asyncio.run(main())
```
## Community
Star the repo and contribute
Join our community
### Are you a startup building agents?
[📅 Hop on a call](https://cal.com/jay-ram-z6st6w/demo) or [📧 founders@hud.so](mailto:founders@hud.so)
# LLM Quickstart
Source: https://docs.hud.so/llm-quickstart
Add context about hud to any coding agent
## Setup
```bash
claude mcp add --transport http docs-hud https://docs.hud.so/mcp
```
Add to MCP settings:
```json
"docs-hud": {
"url": "https://docs.hud.so/mcp"
}
```
Or use one-click install:
[](https://cursor.com/en/install-mcp?name=docs-hud-so\&config=eyJ1cmwiOiJodHRwczovL2RvY3MuaHVkLnNvL21jcCJ9)
Copy paste the full llms-full.txt
## What you get
Your AI assistant gains access to:
* Complete HUD API reference and methods
* Working code examples and implementations
* Architecture patterns and best practices
* Real-time updates as documentation evolves
Try asking your assistant: "How do I create a custom agent in HUD?" or "Help me debug MCP tool calls"
## Next steps
Get started with HUD in 3 minutes
Create custom MCP environments
# Quickstart
Source: https://docs.hud.so/quickstart
Run your first agent evaluation in 3 minutes
Get up and running with HUD in minutes. This guide walks you through installation, basic setup, and running your first agent evaluation.
## Quick Clone
The fastest way to get started:
```bash
# Clone a complete example project with uv
uvx hud-python quickstart
```
This sets you up with a working agent evaluation example you can run immediately.
## Installation
```bash
pip install hud-python
```
```bash
# Includes AI providers and telemetry
pip install "hud-python[agent]"
```
```bash
# Install CLI in isolated environment
uv tool install hud-python
```
## API Keys
Set your API keys as environment variables:
```bash
export HUD_API_KEY="sk-hud-..." # Get from app.hud.so
export ANTHROPIC_API_KEY="sk-ant-..." # For Claude agents
export OPENAI_API_KEY="sk-..." # For OpenAI agents
```
Create a `.env` file in your project root to manage API keys locally
## Your First Agent
Run an agent on the 2048 game environment:
```python
import asyncio, os
import hud
from hud.datasets import Task
from hud.agents import ClaudeAgent
async def main():
# The trace context captures ALL agent interactions within a "task run"
# Everything inside this context shows up as one trace on app.hud.so
with hud.trace("quickstart-2048"):
# Define task with remote MCP environment
task = Task(
prompt="Win a game of 2048 by reaching the 128 tile",
mcp_config={
"hud": {
"url": "https://mcp.hud.so/v3/mcp",
"headers": {
"Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
"Mcp-Image": "hudpython/hud-text-2048:v1.2"
}
}
},
setup_tool={"name": "setup", "arguments": {"name": "board", "arguments": { "board_size": 4}}},
evaluate_tool={"name": "evaluate", "arguments": {"name": "max_number", "arguments": {"target": 64}}}
)
# Run agent (auto-creates MCP client from task.mcp_config)
agent = ClaudeAgent()
result = await agent.run(task)
print(f"Max tile reached: {result.reward}")
asyncio.run(main())
```
The trace context ensures that the entire task run - from setup through evaluation - appears as one coherent trace on the platform
## What just happened?
1. **Task Definition**: We created a `Task` with:
* A prompt telling the agent what to do
* An `mcp_config` pointing to a remote MCP environment
* Setup and evaluation tools to initialize and score the game
2. **Auto Client**: The agent automatically created an MCP client from `task.mcp_config`
3. **Telemetry**: The `trace` context captured all interactions for debugging
4. **Evaluation**: The `evaluate_max_tile` tool returned the highest tile as reward
## Next Steps
Learn how agents connect to environments
Test on SheetBench and OSWorld
Create your own MCP environment
## CLI Quick Reference
```bash
# Analyze any MCP environment
hud analyze hudpython/hud-text-2048:v1.2
# Debug environment initialization
hud debug hudpython/hud-text-2048:v1.2
# Start hot-reload development server
hud dev . --build
# Run HUD tools as MCP server (for Cursor/Claude)
hud mcp
```
See the [CLI Reference](/reference/cli/overview) for detailed command documentation
# Agents
Source: https://docs.hud.so/reference/agents
SDK reference for HUD agent classes
The HUD SDK provides a base `MCPAgent` class and several pre-built agent implementations for interacting with MCP environments.
## Base Class
### MCPAgent
```python
from hud.agents import MCPAgent
```
Abstract base class for all MCP-enabled agents. Handles tool discovery, filtering, and execution flow.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| --------------------------- | ---------------- | ------------------------------------------ | -------------- |
| `mcp_client` | `AgentMCPClient` | MCP client for server connections | `None` |
| `allowed_tools` | `list[str]` | List of tool names to allow | `None` (all) |
| `disallowed_tools` | `list[str]` | List of tool names to disallow | `[]` |
| `lifecycle_tools` | `list[str]` | Tools hidden from agent (setup/evaluate) | `[]` |
| `initial_screenshot` | `bool` | Capture screenshot before first prompt | `False` |
| `system_prompt` | `str` | System prompt to use | Default prompt |
| `append_tool_system_prompt` | `bool` | Append available tools to system | `False` |
| `append_setup_output` | `bool` | Append setup tool output to initial prompt | `False` |
| `model_name` | `str` | Model name for telemetry | `"mcp-agent"` |
| `response_agent` | `ResponseAgent` | Optional auto-response handler | `None` |
| `auto_trace` | `bool` | Enable automatic telemetry | `True` |
**Key Methods:**
```python
async def initialize(task: str | Task | None = None) -> None
"""Initialize agent with task-specific configuration."""
async def run(prompt_or_task: str | Task | dict[str, Any], max_steps: int = 10) -> Trace
"""Run agent with prompt or task. Returns Trace with results."""
async def call_tools(tool_call: MCPToolCall | list[MCPToolCall]) -> list[MCPToolResult]
"""Execute tool calls through MCP client."""
def get_available_tools() -> list[types.Tool]
"""Get filtered list of available tools (excludes lifecycle)."""
def get_tool_schemas() -> list[dict]
"""Get tool schemas formatted for the model."""
```
**Abstract Methods (must implement):**
```python
async def get_system_messages() -> list[Any]
"""Get system prompt formatted for the model."""
async def get_response(messages: list[Any]) -> AgentResponse
"""Get model response including tool calls."""
async def format_blocks(blocks: list[ContentBlock]) -> list[Any]
"""Format content blocks into model messages."""
async def format_tool_results(tool_calls: list[MCPToolCall],
tool_results: list[MCPToolResult]) -> list[Any]
"""Format tool results for the model."""
```
**Class Variables:**
* `metadata: dict[str, Any]` - Metadata injected into MCP initialize request
**Auto-Client Creation:**
If no `mcp_client` is provided but a `Task` with `mcp_config` is passed to `run()`, an MCPClient is automatically created and cleaned up.
## Pre-built Agents
### ClaudeAgent
```python
from hud.agents import ClaudeAgent
```
Claude-specific implementation using Anthropic's API.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| ------------------- | ---------------- | ----------------------- | ------------------------------ |
| `model_client` | `AsyncAnthropic` | Anthropic client | Auto-created |
| `model` | `str` | Claude model to use | `"claude-3-7-sonnet-20250219"` |
| `max_tokens` | `int` | Maximum response tokens | `4096` |
| `use_computer_beta` | `bool` | Use computer-use beta | `True` |
**Features:**
* Native Claude tool calling
* Automatic prompt caching
* Computer-use beta support
* Display metadata injection (1280x720)
**Example:**
```python
agent = ClaudeAgent(
model="claude-3-5-sonnet-20241022",
max_tokens=8192
)
result = await agent.run(
Task(
prompt="Navigate to example.com",
mcp_config={"server": {...}},
evaluate_tool={"name": "evaluate", "arguments": {...}}
)
)
```
### OperatorAgent
```python
from hud.agents import OperatorAgent
```
OpenAI's Operator agent implementation.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| -------------- | ------------- | ----------------------- | --------------------------- |
| `model_client` | `AsyncOpenAI` | OpenAI client | Auto-created |
| `model` | `str` | Model to use | `"gpt-4o-realtime-preview"` |
| `max_tokens` | `int` | Maximum response tokens | `4096` |
**Features:**
* OpenAI function calling
* Operator system prompt
* Display metadata injection (1920x1080)
### GenericOpenAIChatAgent
```python
from hud.agents import GenericOpenAIChatAgent
```
Generic OpenAI chat completion agent for any OpenAI-compatible API.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| -------------- | ------------- | ------------------------ | -------- |
| `model_client` | `AsyncOpenAI` | OpenAI-compatible client | Required |
| `model` | `str` | Model name | Required |
| `max_tokens` | `int` | Maximum response tokens | `4096` |
**Example:**
```python
# Use with local LLM
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url="http://localhost:11434/v1", # Ollama
api_key="not-needed"
)
agent = GenericOpenAIChatAgent(
model_client=client,
model="llama3.1"
)
```
### LangChainAgent
```python
from hud.agents import LangChainAgent
```
LangChain integration for using any LangChain-compatible model.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| --------- | --------------- | -------------------- | -------- |
| `model` | `BaseChatModel` | LangChain chat model | Required |
**Example:**
```python
from langchain_anthropic import ChatAnthropic
model = ChatAnthropic(model="claude-3-opus-20240229")
agent = LangChainAgent(model=model)
```
### ArtHUDAgent
```python
from hud.agents import ArtHUDAgent
```
Integration with ART (Automatic Reasoning and Tool-use) models.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| ------------ | ---------- | ----------------------- | -------- |
| `art_model` | `ArtModel` | ART model instance | Required |
| `max_tokens` | `int` | Maximum response tokens | `4096` |
**Features:**
* Built-in Chain-of-Thought reasoning
* Structured reasoning traces
* Tool use optimization
## Helper Classes
### ResponseAgent
Base class for auto-response handlers that decide when to continue or stop.
```python
from hud.agents.misc import ResponseAgent
class MyResponseAgent(ResponseAgent):
async def determine_response(self, agent_output: str) -> str:
if "task complete" in agent_output.lower():
return "STOP"
return "Continue with the next step"
```
## Common Types
### AgentResponse
```python
from hud.types import AgentResponse
@dataclass
class AgentResponse:
content: str | None = None
tool_calls: list[MCPToolCall] | None = None
done: bool = False
```
### MCPToolCall
```python
from hud.types import MCPToolCall
class MCPToolCall(BaseModel):
name: str
arguments: dict[str, Any] = {}
```
### MCPToolResult
```python
from hud.types import MCPToolResult
class MCPToolResult(BaseModel):
content: list[ContentBlock]
structuredContent: dict[str, Any] | None = None
isError: bool = False
```
### Trace
```python
from hud.types import Trace
class Trace(BaseModel):
reward: float = 0.0
done: bool = True
content: str | None = None
isError: bool = False
info: dict[str, Any] = {}
steps: list[TraceStep] = []
```
## Usage Examples
### Simple Prompt Execution
```python
from hud.agents import ClaudeAgent
from hud.clients import MCPClient
# Manual client creation
client = MCPClient({
"server": {
"command": "docker",
"args": ["run", "-i", "my-env:latest"]
}
})
agent = ClaudeAgent(mcp_client=client)
await agent.initialize()
# Run with string prompt
result = await agent.run("Click the submit button", max_steps=5)
print(f"Result: {result.content}")
print(f"Success: {not result.isError}")
```
### Task Execution with Auto-Client
```python
from hud.agents import OperatorAgent
from hud.datasets import Task
# No client needed - auto-created from task
agent = OperatorAgent()
task = Task(
prompt="Find the price of the product",
mcp_config={
"browser": {
"url": "mcp://mcp.hud.so/v3/mcp",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Mcp-Image": "hudpython/hud-browser:latest"
}
}
},
setup_tool={
"name": "setup",
"arguments": {"url": "https://example.com"}
},
evaluate_tool={
"name": "evaluate",
"arguments": {"check": "price_found"}
}
)
# Client created automatically
result = await agent.run(task, max_steps=20)
print(f"Reward: {result.reward}")
```
### Custom Agent Implementation
```python
from hud.agents import MCPAgent
from hud.types import AgentResponse
import hud
class MyCustomAgent(MCPAgent):
metadata = {"custom": "metadata"}
async def get_system_messages(self) -> list[dict]:
return [{
"role": "system",
"content": self.system_prompt
}]
@hud.instrument(span_type="agent", record_args=False, record_result=True)
async def get_response(self, messages: list[dict]) -> AgentResponse:
# Your LLM call here
response = await self.llm.chat(messages)
return AgentResponse(
content=response.content,
tool_calls=[
MCPToolCall(name=tc.name, arguments=tc.args)
for tc in response.tool_calls
],
done=response.stop_reason == "stop"
)
async def format_blocks(self, blocks: list[ContentBlock]) -> list[dict]:
content = []
for block in blocks:
if block.type == "text":
content.append({"type": "text", "text": block.text})
elif block.type == "image":
content.append({
"type": "image",
"image": {"data": block.data, "format": "png"}
})
return [{"role": "user", "content": content}]
async def format_tool_results(self, tool_calls, tool_results) -> list[dict]:
return [{
"role": "tool",
"content": result.content,
"tool_call_id": call.name
} for call, result in zip(tool_calls, tool_results)]
```
## See Also
* [Create Agents](/evaluate-agents/create-agents) - Tutorial on building agents
* [Tasks](/reference/tasks) - Task configuration reference
* [Architecture](/core-concepts/architecture) - How agents fit in HUD
# hud analyze
Source: https://docs.hud.so/reference/cli/analyze
Inspect MCP environments quickly from metadata or live containers
The `hud analyze` command inspects MCP environments to discover their tools and capabilities. By default, it uses cached metadata for instant results. Use `--live` for real-time analysis.
## Usage
```bash
hud analyze [OPTIONS]
```
## Arguments
Docker image, lock file path, or command to analyze
## Options
Output format: `interactive`, `json`, or `markdown`. Short: `-f`
Show full tool schemas and parameters. Short: `-v`
Run container for live analysis (slower but more accurate)
JSON config file with MCP configuration. Short: `-c`
Analyze a server from Cursor config
Analyze a local command instead of Docker image
Timeout in seconds for live analysis
## Analysis Modes
### Fast Mode (Default)
Uses cached metadata from:
1. Local lock file cache (`~/.hud/locks/`)
2. HUD registry (if available)
3. Basic Docker manifest info
```bash
# Instant results from metadata
hud analyze hudpython/text-analyzer:latest
# From lock file
hud analyze ./my-env.lock.yaml
```
Fast mode is perfect for quick inspection and doesn't require running containers.
### Live Mode
Runs the actual container for comprehensive analysis:
```bash
# Full analysis with running container
hud analyze hudpython/text-analyzer:latest --live
# With environment variables
hud analyze my-env:latest --live -e API_KEY=test
```
Use `--live` when you need:
* Real-time tool validation
* Resource discovery
* Telemetry information
* Testing with specific env vars
## Output Formats
### Interactive (Default)
Fast mode output:
```
📊 Environment Overview
┌─────────────┬─────────────────────────┐
│ Image │ hudpython/text-2048 │
│ Source │ HUD Registry │
│ Built │ 2024-01-15T10:30:00Z │
│ HUD Version │ 0.1.0 │
│ Init Time │ 450 ms │
│ Tools │ 6 │
└─────────────┴─────────────────────────┘
🔧 Available Tools
└── Tools
├── setup - Initialize environment
├── evaluate - Return environment state
├── move - Move tiles in direction
├── reset_board - Reset game to start
└── get_score - Get current score
```
Live mode output (`--live`):
```
🔍 Analyzing MCP environment: hudpython/text-2048:latest
📊 Environment Overview
┌─────────────┬─────────────────────────┐
│ Server │ hud-text-2048 │
│ Initialized │ ✓ │
└─────────────┴─────────────────────────┘
🔧 Available Tools
├── Regular Tools
│ ├── move
│ │ └── Move tiles in the specified direction
│ ├── reset_board
│ │ └── Reset the game board to initial state
│ └── get_score
│ └── Get the current game score
└── Hub Tools
└── game_hub
├── save_state
└── load_state
📚 Available Resources
┌──────────────────┬────────────────┬─────────────┐
│ URI │ Name │ Type │
├──────────────────┼────────────────┼─────────────┤
│ game://state │ Game State │ application │
│ game://history │ Move History │ text/plain │
└──────────────────┴────────────────┴─────────────┘
📡 Telemetry Data
┌─────────┬─────────────────────────────┐
│ Live URL│ https://app.hud.so/xyz123 │
│ Status │ running │
│ Services│ 2/2 running │
└─────────┴─────────────────────────────┘
```
### JSON Format
```bash
# Fast mode
hud analyze my-env --format json
{
"image": "my-env:latest",
"status": "from_cache",
"tool_count": 6,
"init_time": 450,
"tools": [{
"name": "setup",
"description": "Initialize environment"
}]
}
# Live mode
hud analyze my-env --format json --live
{
"metadata": {
"servers": ["my-env"],
"initialized": true
},
"tools": [{
"name": "setup",
"description": "Initialize environment",
"input_schema": {...}
}],
"hub_tools": {...},
"resources": [...],
"telemetry": {...}
}
```
### Markdown Format
```bash
hud analyze my-env --format markdown > docs/tools.md
```
Generates formatted documentation with tool descriptions and schemas.
## Examples
### Fast Analysis (Default)
```bash
# Quick inspection from metadata
hud analyze hudpython/text-analyzer:latest
# Analyze multiple environments rapidly
for env in browser scraper analyzer; do
hud analyze "hudpython/$env:latest"
done
# From lock file
hud analyze ./environments/prod.lock.yaml
# With verbose schemas
hud analyze my-env:latest --verbose
```
### Live Analysis
```bash
# Full container analysis
hud analyze my-env:latest --live
# With environment variables
hud analyze my-env:latest --live -e API_KEY=test -e DEBUG=true
# Custom timeout for slow environments
hud analyze heavy-env:latest --live --timeout 60
```
### Local Development
```bash
# Analyze local command
hud analyze --command "python my_server.py" --live
# Cursor integration
hud analyze --cursor my-dev-server --live
# With config file
hud analyze my-env:latest --config mcp-config.json
```
### Documentation Generation
```bash
# Generate tool documentation
hud analyze my-env:latest --format markdown > docs/tools.md
# Compare versions
hud analyze my-env:v1.0 --format json > v1.json
hud analyze my-env:v2.0 --format json > v2.json
diff v1.json v2.json
# Extract tool names
hud analyze my-env --format json | jq -r '.tools[].name'
```
## Performance Comparison
| Mode | Speed | Use Case |
| --------------- | ------------ | --------------------------- |
| Fast (default) | \< 1 second | Quick inspection, CI checks |
| Live (`--live`) | 5-30 seconds | Full validation, debugging |
Use fast mode for rapid iteration during development. Switch to `--live` for final validation.
## CI/CD Integration
### Fast CI Checks
```yaml
# GitHub Actions - Quick tool validation
- name: Verify tools exist
run: |
TOOLS=$(hud analyze $IMAGE --format json | jq '.tool_count')
if [ "$TOOLS" -lt 3 ]; then
echo "Not enough tools!"
exit 1
fi
```
### Python Validation Script
```python
#!/usr/bin/env python3
import subprocess
import json
def verify_environment(image, required_tools):
"""Fast validation using metadata"""
# Quick check with metadata
result = subprocess.run(
["hud", "analyze", image, "--format", "json"],
capture_output=True, text=True, check=True
)
data = json.loads(result.stdout)
available = {tool["name"] for tool in data["tools"]}
missing = set(required_tools) - available
if missing:
print(f"⚠️ Missing tools: {missing}")
print("Running live analysis for details...")
# Full check if needed
subprocess.run(
["hud", "analyze", image, "--live"],
check=True
)
return len(missing) == 0
# Usage
verify_environment("my-env:latest", ["setup", "evaluate", "process"])
```
## Troubleshooting
The environment may not have been pushed to HUD registry:
```bash
# Falls back to basic Docker info
hud analyze my-env:latest --verbose
# Force live analysis
hud analyze my-env:latest --live
```
If metadata seems outdated:
```bash
# Use live mode for current state
hud analyze my-env:latest --live
# Rebuild and push to update metadata
hud build && hud push
```
Some environments take longer to initialize:
```bash
# Increase timeout
hud analyze heavy-env:latest --live --timeout 120
# Check with debug first
hud debug heavy-env:latest --max-phase 1
```
Ensure clean output:
```bash
# Redirect stderr for clean JSON
hud analyze my-env --format json 2>/dev/null
# Or use jq for validation
hud analyze my-env --format json | jq .
```
## Best Practices
1. **Default to Fast Mode**: Start with metadata for quick checks
2. **Live for Validation**: Use `--live` before production deployments
3. **Cache Lock Files**: Share lock files for consistent metadata
4. **Version Your Tools**: Track tool changes across versions
5. **Automate Checks**: Add fast analysis to CI pipelines
Fast mode shows what tools *should* be available based on build-time analysis. Live mode shows what *actually* works right now.
## See Also
* [`hud debug`](/reference/cli/debug) - Test tool functionality
* [`hud pull`](/reference/cli/pull) - Get environments with metadata
* [`hud build`](/reference/cli/build) - Generate fresh metadata
* [`hud run`](/reference/cli/run) - Execute environments
# hud build
Source: https://docs.hud.so/reference/cli/build
Build production images and generate lock files for reproducible environments
The `hud build` command creates production Docker images and generates `hud.lock.yaml` files that capture all metadata needed for reproducible deployments.
## Usage
```bash
hud build [DIRECTORY] [OPTIONS]
```
## Arguments
Environment directory containing Dockerfile and pyproject.toml
## Options
Docker image tag. Auto-generates based on directory name if not specified. Short: `-t`
Build without using Docker cache
Show detailed build output. Short: `-v`
## What It Does
The build command performs several critical steps:
Runs `docker build` with appropriate tags and build arguments
Connects to the built image and analyzes:
* Initialization time
* Available tools and descriptions
* Required/optional environment variables
* Protocol capabilities
Creates `hud.lock.yaml` with:
* Image digest (sha256 hash)
* Build metadata
* Tool summaries
* Environment requirements
Adds metadata labels to the Docker image for tracking
## Lock File Structure
The generated `hud.lock.yaml` contains:
```yaml
version: "1.0"
image: "my-env:latest@sha256:abc123..."
build:
generatedAt: "2024-01-15T10:30:00Z"
hudVersion: "0.1.0"
directory: "my-env"
environment:
initializeMs: 450
toolCount: 12
variables:
required: ["API_KEY", "DATABASE_URL"]
optional: ["DEBUG", "LOG_LEVEL"]
tools:
- name: "setup"
description: "Initialize environment"
- name: "evaluate"
description: "Return environment state"
- name: "process_data"
description: "Process input data with advanced algorithms..."
```
## Examples
### Basic Build
Build with auto-generated tag:
```bash
hud build
# Creates: hud-my-env:latest
```
### Custom Tag
Specify version tag:
```bash
hud build . --tag v1.2.0
# Creates: hud-my-env:v1.2.0
```
### Clean Build
Force rebuild without cache:
```bash
hud build . --no-cache --verbose
```
### CI/CD Build
Build with specific naming:
```bash
hud build . --tag "prod-$(git rev-parse --short HEAD)"
# Creates: hud-my-env:prod-abc123f
```
## Image Naming
If no tag is specified via `--tag`, the command:
1. Checks `pyproject.toml` for `[tool.hud.image]`
2. Auto-generates: `hud-{directory-name}:latest`
3. Converts underscores to hyphens for Docker compatibility
Examples:
* Directory `my_tool` → Image `hud-my-tool:latest`
* Directory `browser-env` → Image `hud-browser-env:latest`
## Build Process
### Dockerfile Detection
The command expects a `Dockerfile` in the environment directory:
```dockerfile
FROM python:3.10-slim
RUN pip install uv
WORKDIR /app
COPY pyproject.toml .
RUN uv pip install --system -e .
COPY src/ src/
CMD ["python", "-m", "my_package.server"]
```
### Environment Variables
Automatically extracts from Dockerfile:
* `ENV` declarations → Optional variables
* `ARG` with no default → Required variables
```dockerfile
# Detected as required
ARG API_KEY
# Detected as optional
ENV DEBUG=false
ENV LOG_LEVEL=info
```
### Build Arguments
Passes through Docker build arguments:
```bash
hud build . --build-arg API_VERSION=v2
```
## Lock File Usage
The generated `hud.lock.yaml` enables:
Exact image reference with digest ensures identical deployment
Tool information available without running container
Metadata for `hud push` to share environments
Version tracking for debugging and updates
## Integration
### With hud push
After building, share your environment:
```bash
hud build . --tag v1.0.0
hud push
```
### With CI/CD
```yaml
# GitHub Actions example
- name: Build HUD environment
run: |
hud build . --tag "${{ github.sha }}"
- name: Verify build
run: |
hud debug "hud-my-env:${{ github.sha }}"
- name: Push to registry
run: |
hud push
```
### Version Management
```bash
# Semantic versioning
hud build . --tag v1.0.0
hud build . --tag v1.1.0-beta
# Date-based
hud build . --tag "$(date +%Y%m%d)"
# Git-based
hud build . --tag "$(git describe --tags --always)"
```
## Troubleshooting
### Build Failures
Check Dockerfile syntax and ensure all files are present:
```bash
docker build . --no-cache
```
The environment may be taking too long to initialize:
```bash
hud build . --verbose # See detailed logs
```
Ensure the environment implements required tools:
* `setup()` - Must exist
* `evaluate()` - Must return a dictionary
### Performance Tips
Structure your Dockerfile to maximize layer caching:
1. Install system dependencies first
2. Copy dependency files (pyproject.toml)
3. Install Python dependencies
4. Copy source code last
Using `--no-cache` significantly increases build time. Only use when necessary.
## Best Practices
1. **Tag Consistently**: Use semantic versioning or git-based tags
2. **Cache Wisely**: Structure Dockerfile for optimal caching
3. **Test Builds**: Always run `hud debug` after building
4. **Document Variables**: Clearly indicate required vs optional env vars
5. **Keep Images Small**: Use slim base images and multi-stage builds
## See Also
* [`hud init`](/reference/cli/init) - Create new environments
* [`hud dev`](/reference/cli/dev) - Development with hot-reload
* [`hud push`](/reference/cli/push) - Share to registry
* [`hud analyze`](/reference/cli/analyze) - Inspect built images
* [Build Environments](/build-environments) - Complete guide
# hud debug
Source: https://docs.hud.so/reference/cli/debug
Test MCP environments through 5 validation phases
The `hud debug` command validates MCP environments through 5 progressive phases.
## Synopsis
```bash
hud debug [OPTIONS] IMAGE [DOCKER_ARGS...]
hud debug --command COMMAND [ARGS...]
hud debug --cursor SERVER_NAME
```
## Options
| Option | Description | Default |
| ------------- | --------------------------------- | ------- |
| `--max-phase` | Stop at specified phase (1-5) | 5 |
| `--command` | Debug a command instead of Docker | - |
| `--cursor` | Debug a Cursor MCP server | - |
| `--timeout` | Timeout per phase in seconds | 30 |
| `-e, --env` | Set environment variables | - |
## Debug Phases
### Phase 1: Container Startup
```
🚀 Phase 1: Container startup
✓ Container started successfully
✓ Received output on stderr
```
Checks: Process starts, outputs to stderr, doesn't exit
### Phase 2: MCP Initialization
```
🔗 Phase 2: MCP initialization
✓ Received valid initialize response
✓ Protocol version: 1.0
```
Checks: JSON-RPC communication, protocol compatibility
### Phase 3: Tool Discovery
```
🔧 Phase 3: Tool discovery
✓ Found 8 tools:
• setup_board - Initialize game board
• move - Move in direction
```
Checks: Tools registered, valid schemas
### Phase 4: Tool Execution
```
🧪 Phase 4: Tool testing
→ Testing tool: setup_board
✓ Tool executed successfully
```
Checks: Tools callable, return valid responses
### Phase 5: Readiness Check
```
✅ Phase 5: Readiness check
✓ Ready for agent interaction
```
Checks: Overall health, state persistence
Once your environment passes Phase 5, consider using `hud dev` for hot-reload development. This eliminates rebuild cycles and speeds up iteration.
## Examples
### Docker Debug
```bash
# Basic
hud debug my-server:latest
# With environment variables
hud debug my-server:latest -e DEBUG=true -e LOG_LEVEL=debug
# Stop at phase 3
hud debug my-server:latest --max-phase 3
```
### Local Development
```bash
# Python script
hud debug --command "python my_server.py"
# With arguments
hud debug --command "python -m my_module.server --port 8000"
```
### Cursor Integration
```bash
# Debug Cursor server
hud debug --cursor my-dev-env
```
## Common Issues
### Phase 1 Failures
* **Container exits**: Check CMD/ENTRYPOINT
* **No stderr**: Add `print(..., file=sys.stderr)`
* **Permission denied**: Check file permissions
### Phase 2 Failures
* **Invalid JSON**: Only JSON-RPC on stdout
* **Timeout**: Check for blocking operations
### Phase 3 Failures
* **No tools**: Check `@server.tool()` decorators
* **Invalid schema**: Add type hints and docstrings
## Advanced Usage
### Incremental Debugging
```bash
# Debug phase by phase
for phase in 1 2 3 4 5; do
echo "Testing phase $phase..."
hud debug my-env:latest --max-phase $phase || break
done
```
### Development Mode
```bash
# Once debug passes, use hot-reload for development
cd environments/my-env
hud dev . --build
# Make changes to src/ files
# Test changes without rebuilding
hud analyze hud-my-env:dev
```
### Parallel Testing
```bash
# Test multiple environments
images=("env1:latest" "env2:latest")
for image in "${images[@]}"; do
hud debug "$image" > "debug_${image//:/}.log" 2>&1 &
done
wait
```
## Best Practices
1. Start with phase 1 and work up
2. Use `--max-phase` for incremental debugging
3. Check stderr output first
4. Keep tools simple for easier debugging
5. Add health check tools
If debug passes, agents should work reliably with your environment.
## Next Step
Explore environment capabilities after debugging
# hud dev
Source: https://docs.hud.so/reference/cli/dev
Hot-reload development server for MCP environments
The `hud dev` command provides hot-reload development for MCP environments, automatically restarting your server when source files change.
## Synopsis
```bash
hud dev [DIRECTORY] [OPTIONS]
```
## Description
`hud dev` creates a development proxy server that runs your MCP environment in Docker with hot-reload capabilities. It automatically detects or builds your Docker image and exposes it via HTTP or stdio for development.
**Key Features:**
* **Auto-detection**: Automatically finds or generates Docker image names
* **Hot-reload**: Mounts local `src/` directory for live code changes
* **Interactive testing**: Test tools directly from the command line with arrow keys
* **HTTP/Stdio protocols**: Choose transport method
* **Inspector support**: Launch MCP Inspector for debugging
* **Container management**: Handles Docker lifecycle automatically
## Arguments
| Argument | Description | Default |
| ----------- | -------------------------------- | ------------- |
| `DIRECTORY` | Environment directory (optional) | `.` (current) |
## Options
| Option | Description | Default |
| ------------------- | -------------------------------------------- | ------------- |
| `--image`, `-i` | Docker image name (overrides auto-detection) | Auto-detected |
| `--build`, `-b` | Build image before starting | `false` |
| `--no-cache` | Force rebuild without cache | `false` |
| `--transport`, `-t` | Transport protocol: `http` or `stdio` | `http` |
| `--port`, `-p` | HTTP server port (ignored for stdio) | `8765` |
| `--interactive` | Launch interactive tool testing interface | `false` |
| `--no-reload` | Disable hot-reload functionality | `false` |
| `--verbose`, `-v` | Show detailed server logs | `false` |
| `--inspector` | Launch MCP Inspector (HTTP mode only) | `false` |
| `--no-logs` | Disable streaming Docker logs | `false` |
## Examples
### Auto-Detection Mode (Recommended)
Start development server without specifying directory (auto-detects from current location):
```bash
hud dev
```
### Build and Start
Build Docker image first, then start development server:
```bash
hud dev --build
```
### Specific Directory
Watch a specific environment directory:
```bash
hud dev environments/my-browser-env --build
```
### Custom Image
Use a specific Docker image name:
```bash
hud dev . --image my-custom-env:dev --build
```
### HTTP Mode with Inspector
Start in HTTP mode and launch MCP Inspector:
```bash
hud dev . --build --transport http --inspector
```
### Stdio Mode
Use stdio transport (for multiple simultaneous connections):
```bash
hud dev . --build --transport stdio
```
### Clean Rebuild
Force a complete rebuild without Docker cache:
```bash
hud dev . --build --no-cache
```
### Verbose Logging
Show detailed logs from the container:
```bash
hud dev . --build --verbose
```
### Interactive Testing
Launch with interactive tool testing interface:
```bash
hud dev . --build --interactive
```
Interactive mode disables Docker logs and hot-reload to provide a clean interface for testing tools with arrow key navigation.
## How It Works
1. **Image Resolution**:
* Checks `pyproject.toml` for `[tool.hud.image]` setting
* Auto-generates name like `hud-{directory-name}:dev` if not found
* Uses `--image` override if provided
2. **Docker Setup**:
* Builds image with `--build` if requested
* Mounts local `src/` directory to `/app/src` in container
* Injects hot-reload supervisor (unless `--no-reload`)
3. **Proxy Server**:
* Creates FastMCP proxy that forwards to Docker containers
* Each client connection gets its own container instance
* Automatically handles container lifecycle
4. **Development Flow**:
* Edit files in `src/` directory
* Changes are immediately reflected in running containers
* No rebuild required for code changes
* **Only the MCP server restarts**, not your environment state
### Process Separation Architecture
For stateful environments, `hud dev` supports a critical design pattern: separating the MCP server from the environment process. This separation enables hot-reload without losing state.
**Why Separation Matters:**
* MCP server can restart instantly for code changes
* Environment state (browsers, databases, games) persists
* Development is faster without constant state resets
**Example Architecture:**
```
┌─────────────────┐ ┌──────────────────────┐
│ MCP Server │────▶│ Environment Process │
│ (Restartable) │ │ (Persistent) │
│ │ │ │
│ - Lightweight │ │ - Maintains state │
│ - Fast startup │ │ - Long-lived │
│ - MCP protocol │ │ - Resource heavy │
└─────────────────┘ └──────────────────────┘
▲ │
│ │
└─── Unix Socket ─────────┘
TCP, gRPC, etc.
```
**Connection Methods:**
* Unix sockets (recommended for local dev)
* TCP/HTTP endpoints
* gRPC services
* Shared memory/IPC
See the browser environment for a complete implementation example.
## File Mounting
The development server mounts your local `src/` directory directly into containers:
```
Local: ./src/ → Container: /app/src/
Local: ./src/my_package/ → Container: /app/src/my_package/
Local: ./src/server.py → Container: /app/src/server.py
```
**Live Changes**: Any edits to files in `src/` are immediately reflected in running containers.
**Hot-Reload Behavior**:
* With process separation: Only the MCP server restarts, environment state persists
* Without separation: Entire environment restarts (state is lost)
* Disable with `--no-reload` if manual control is preferred
For environments with important state (browsers, databases, running applications), implement process separation to preserve state during hot-reload. This dramatically improves development speed.
## Interactive Mode
When launched with `--interactive`, `hud dev` provides a command-line interface for testing tools:
```bash
hud dev . --interactive
```
**Features:**
* **Arrow Key Navigation**: Select tools using up/down arrows
* **Parameter Input**: Enter tool parameters with validation
* **Result Display**: See tool responses in real-time
* **Clean Interface**: Disables logs for focused testing
* **Exit Gracefully**: Select "Quit" or press Ctrl+C
**Workflow:**
1. Environment starts and connects
2. Tools list appears with arrow navigation
3. Select a tool and press Enter
4. Input parameters if required
5. See results immediately
6. Continue testing or quit
Interactive mode is perfect for rapid tool development - make changes to your code and test immediately without leaving the terminal.
## Transport Modes
### HTTP Transport (Default)
```bash
hud dev . --transport http --port 8080
```
**Benefits:**
* Web browser access
* MCP Inspector integration
* Multiple simultaneous connections
* RESTful debugging
**URL:** `http://localhost:8080/mcp`
### Stdio Transport
```bash
hud dev . --transport stdio
```
**Benefits:**
* Direct MCP protocol
* Lower latency
* Cursor/Claude integration ready
* Single connection focused
**Usage:** Connect via stdio to the running container
## Integration Examples
### Cursor Integration
1. Start development server:
```bash
hud dev . --build --transport http --port 8765
```
2. Add to Cursor's MCP config:
```json
{
"mcpServers": {
"my-dev-env": {
"url": "http://localhost:8765/mcp"
}
}
}
```
3. Edit files - changes apply immediately!
### Testing During Development
Run analysis in another terminal:
```bash
# Test tools are working
hud analyze hud-my-env:dev
# Full debug check
hud debug hud-my-env:dev
```
### MCP Inspector
Launch inspector for visual debugging:
```bash
hud dev . --build --inspector
```
Opens browser with MCP Inspector interface showing:
* Available tools and schemas
* Real-time tool calls
* Protocol messages
* Error debugging
## Troubleshooting
### Port Already in Use
```bash
# Use different port
hud dev . --port 9000
# Or find and kill existing process
lsof -ti :8765 | xargs kill
```
### Build Failures
```bash
# Clean rebuild
hud dev . --build --no-cache
# Check Dockerfile
docker build -t my-env .
```
### File Changes Not Detected
```bash
# Check file permissions
ls -la src/
# Force restart (Ctrl+C then restart)
hud dev . --build --verbose
```
### Container Startup Issues
```bash
# Debug with full logs
hud dev . --build --verbose
# Test image manually
docker run -it my-env:dev
```
## Performance Tips
1. **Exclude Large Files**: Use `.dockerignore` to exclude large files from build context
2. **Layer Caching**: Structure Dockerfile for optimal layer caching
3. **Minimal Rebuilds**: Only use `--build` when dependencies change
4. **Port Selection**: Use uncommon ports to avoid conflicts
## Development Workflow
Recommended workflow for environment development:
```bash
# 1. Navigate to your environment directory
cd environments/my-env
# 2. Start development server with interactive testing
hud dev --build --interactive
# 3. Test tools interactively with arrow keys
# Make code changes and test again immediately
# 4. Alternative: Run dev server and test separately
# Terminal 1: Keep hud dev running
hud dev --build
# Terminal 2: Test your changes
hud analyze hud-my-env:dev
# 5. Production build
hud build # Creates hud.lock.yaml
# 6. Share your environment
hud push
```
Using `hud dev --build` also generates or updates `hud.lock.yaml` with environment metadata, similar to `hud build`.
## See Also
* [`hud init`](/reference/cli/init) - Create new environments
* [`hud build`](/reference/cli/build) - Build production images
* [`hud push`](/reference/cli/push) - Share to registry
* [`hud analyze`](/reference/cli/analyze) - Test tools after changes
* [`hud debug`](/reference/cli/debug) - Full environment validation
* [`hud run`](/reference/cli/run) - Run production environments
* [Build Environments](/build-environments) - Environment development guide
# hud init
Source: https://docs.hud.so/reference/cli/init
Create a new HUD environment with minimal boilerplate
The `hud init` command creates a new HUD environment from scratch with all the essential files needed to build, test, and deploy an MCP server.
## Usage
```bash
hud init [NAME] [OPTIONS]
```
## Arguments
Environment name. If not provided, uses the current directory name.
## Options
Target directory where the environment will be created. Short: `-d`
Overwrite existing files if they exist. Short: `-f`
## What It Creates
The command generates a minimal but complete HUD environment structure:
```
my-env/
├── Dockerfile # Container configuration
├── pyproject.toml # Dependencies and metadata
├── README.md # Basic documentation
└── src/
└── hud_controller/
├── context.py # Persistent state management
└── server.py # MCP server implementation
```
### Generated Files
Minimal MCP server with required tools:
```python
from hud.server import MCPServer
from hud.server.context import attach_context
mcp = MCPServer(name="my-env")
ctx = None
@mcp.initialize
async def init(init_ctx):
global ctx
ctx = attach_context("/tmp/hud_ctx.sock")
@mcp.tool()
async def act() -> str:
"""Perform an action."""
return f"Action #{ctx.act()}"
@mcp.tool()
async def setup() -> str:
"""Required for HUD environments."""
return "Ready"
@mcp.tool()
async def evaluate() -> dict:
"""Required for HUD environments."""
return {"count": ctx.get_count()}
if __name__ == "__main__":
mcp.run()
```
Persistent state that survives hot-reloads:
```python
from hud.server.context import run_context_server
import asyncio
class Context:
def __init__(self):
self.count = 0
def act(self):
self.count += 1
return self.count
def get_count(self):
return self.count
if __name__ == "__main__":
asyncio.run(run_context_server(Context()))
```
Production-ready container setup:
```dockerfile
FROM python:3.10-slim
RUN pip install uv
WORKDIR /app
COPY pyproject.toml .
RUN uv pip install --system -e .
COPY src/ src/
CMD ["python", "-m", "hud_controller.server"]
```
Project configuration with HUD metadata:
```toml
[project]
name = "my-env"
version = "0.1.0"
dependencies = ["hud[sdk]"]
[tool.hud]
image = "hud-my-env:dev"
```
## Examples
### Basic Usage
Create an environment in a new directory:
```bash
hud init my-env
cd my-env
```
### Current Directory
Initialize in the current directory using its name:
```bash
mkdir awesome-tool && cd awesome-tool
hud init
```
### Custom Location
Create in a specific directory:
```bash
hud init my-env --dir ~/projects
cd ~/projects/my-env
```
### Overwrite Existing
Force overwrite existing files:
```bash
hud init my-env --force
```
## Next Steps
After creating your environment:
Run with hot-reload and interactive testing:
```bash
hud dev . --interactive
```
Edit `src/hud_controller/server.py` to add custom tools:
```python
@mcp.tool()
async def my_custom_tool(text: str) -> str:
"""Process some text."""
return text.upper()
```
In interactive mode, use arrow keys to select and test tools
When ready, build and share:
```bash
hud build
hud push
```
## Key Concepts
### Required Tools
Every HUD environment must implement:
* `setup()` - Initialize the environment
* `evaluate()` - Return environment state for evaluation
### Context Pattern
The context pattern separates state from server logic:
* State persists across hot-reloads
* Server reconnects to existing context
* Enables rapid development iteration
### Minimal Dependencies
The template uses only essential dependencies:
* `hud[sdk]` - Core HUD functionality
* No unnecessary packages
* Production-ready from the start
## Tips
The generated environment is immediately runnable with `hud dev`. No additional setup required!
Environment names are converted to valid Docker image names automatically (underscores → hyphens).
Use `--force` carefully as it will overwrite existing files without confirmation.
## See Also
* [hud dev](/reference/cli/dev) - Start development with hot-reload
* [hud build](/reference/cli/build) - Build production images
* [Build Environments Guide](/build-environments) - Complete environment development guide
# CLI Overview
Source: https://docs.hud.so/reference/cli/overview
Complete reference for HUD command-line tools
The HUD CLI provides a complete toolkit for creating, developing, and running MCP environments. Commands are organized into two main workflows:
**Directory-based commands** for creating and sharing environments:
* `hud init` - Create new environment
* `hud dev` - Develop with hot-reload
* `hud build` - Build and generate lock file
* `hud push` - Share to registry
**Target-based commands** for consuming environments:
* `hud pull` - Get from registry
* `hud analyze` - Inspect capabilities
* `hud debug` - Test compliance
* `hud run` - Execute locally/remotely
## Installation
```bash uv (Recommended)
uv tool install hud-python
hud --version
```
```bash pip
pip install hud-python
python -m hud.cli --help
```
```bash pipx
pipx install hud-python
hud --version
```
## Commands
### Building Workflow
| Command | Input | Description | Example |
| ----------- | --------- | ----------------------- | ------------------------- |
| `hud init` | Directory | Create new environment | `hud init my-env` |
| `hud dev` | Directory | Hot-reload development | `hud dev . --interactive` |
| `hud build` | Directory | Build image & lock file | `hud build . --tag v1.0` |
| `hud push` | Directory | Share to registry | `hud push . --tag prod` |
### Running Workflow
| Command | Input | Description | Example |
| ------------- | ------ | ----------------------------- | ------------------------- |
| `hud pull` | Target | Get environment from registry | `hud pull org/env:latest` |
| `hud analyze` | Target | Inspect tools & capabilities | `hud analyze org/env` |
| `hud debug` | Target | Test MCP compliance | `hud debug my-env:latest` |
| `hud run` | Target | Execute server | `hud run my-env --local` |
### Utilities
| Command | Description | Example |
| ----------------- | --------------------------- | ----------------- |
| `hud quickstart` | Clone quickstart repository | `hud quickstart` |
| `hud cursor-list` | List Cursor MCP servers | `hud cursor-list` |
| `hud version` | Show version info | `hud version` |
## Complete Workflows
### Building an Environment
Create a new HUD environment with minimal boilerplate:
```bash
hud init my-env && cd my-env
```
Creates `Dockerfile`, `pyproject.toml`, and minimal MCP server with context.
Run with hot-reload and interactive testing:
```bash
hud dev
```
Your changes reload automatically. Test tools interactively with arrow keys.
Create production image and lock file:
```bash
hud build
```
Generates `hud.lock.yaml` with metadata and labels image for reproducibility.
Share to Docker and HUD registries:
```bash
hud push
```
Requires `HUD_API_KEY`. Auto-detects registry from Docker login.
### Running an Environment
Get environment with metadata preview:
```bash
hud pull hudpython/text_init:latest
```
Shows tools, env vars, and build info before downloading.
Quick inspection without running:
```bash
hud analyze hudpython/text_init # Fast (from metadata)
hud analyze hudpython/text_init --live # Full (runs container)
```
Test MCP protocol compliance:
```bash
hud debug hudpython/text_init:latest
```
Validates through 5 phases of initialization.
Execute in production mode:
```bash
hud run hudpython/text_init:latest # Remote (default)
hud run hudpython/text_init:latest --local # Local Docker
```
## Common Usage
### Docker Images
```bash
# Basic
hud debug my-image:latest
# With options
hud debug my-image:latest -e DEBUG=true -p 8080:8080
# From registry
hud analyze ghcr.io/org/image:v1.0.0
```
### Local Commands
```bash
# Python script
hud debug --command "python server.py"
# Node.js server
hud debug --command "node mcp-server.js"
```
### Cursor Integration
```bash
# Debug Cursor server
hud debug --cursor my-dev-server
# List all servers
hud cursor-list
```
## Output Formats
### Interactive (Default)
```bash
hud analyze my-env
🔍 Analyzing MCP environment: my-env
✓ Connected successfully
✓ Found 12 tools
Available Tools:
• click - Click at coordinates
• type - Type text
...
```
### JSON
```bash
hud analyze my-env --format json
{
"tools": [{
"name": "click",
"description": "Click at coordinates",
"parameters": {...}
}]
}
```
### Markdown
```bash
hud analyze my-env --format markdown > docs/tools.md
```
## CI/CD Example
```bash
#!/bin/bash
set -e
# Test environment
hud debug "$IMAGE_NAME"
# Verify tools
TOOLS=$(hud analyze "$IMAGE_NAME" --format json | jq '.tools | length')
if [ "$TOOLS" -lt 3 ]; then
echo "Not enough tools!"
exit 1
fi
```
## Python Scripting
```python
import subprocess
import json
def get_tools(image):
result = subprocess.run(
["hud", "analyze", image, "--format", "json"],
capture_output=True,
text=True,
check=True
)
return json.loads(result.stdout)["tools"]
# Use
tools = get_tools("my-env:latest")
for tool in tools:
print(f"- {tool['name']}: {tool['description']}")
```
## Exit Codes
| Code | Meaning | Description |
| ---- | ---------------- | ------------------- |
| 0 | Success | Command completed |
| 1 | General Error | Command failed |
| 2 | Usage Error | Invalid arguments |
| 3 | Connection Error | Failed to connect |
| 4 | Timeout | Operation timed out |
| 5 | Protocol Error | MCP violation |
## Environment Variables
```bash
# Debug output
export HUD_CLI_DEBUG=true
# Custom timeout
export HUD_CLI_TIMEOUT=120
# Provider keys
export ANCHOR_API_KEY=...
```
## Next Steps
### Building Commands
Create new environments from scratch
Develop with hot-reload and interactive testing
Build images and generate lock files
Share environments to registry
### Running Commands
Get environments with metadata preview
Inspect tools and capabilities
Test MCP protocol compliance
Execute servers locally or remotely
# hud pull
Source: https://docs.hud.so/reference/cli/pull
Get HUD environments from registry with metadata preview
The `hud pull` command fetches HUD environments from registries, showing a preview of tools and capabilities before downloading the Docker image.
## Usage
```bash
hud pull [OPTIONS]
```
## Arguments
Image reference (e.g., `org/env:tag`) or path to lock file
## Options
Path to save lock file (if target is image ref). Short: `-l`
Skip confirmation prompt. Short: `-y`
Only verify metadata without pulling image
Show detailed output. Short: `-v`
## How It Works
Retrieves environment information from:
* HUD registry (if available)
* Docker manifest (fallback)
* Local lock file (if provided)
Shows environment details:
* Available tools and descriptions
* Build information
* Required environment variables
* Image size
Prompts for confirmation before pulling (unless `--yes`)
Downloads Docker image using `docker pull`
Stores metadata locally for offline access
## Examples
### Basic Pull
Pull from HUD registry:
```bash
hud pull hudpython/text-analyzer:latest
```
Output:
```
🚀 HUD Environment Pull
Fetching metadata for: hudpython/text-analyzer:latest
📦 Environment Details
┌─────────────┬──────────────────────────┐
│ Image │ hudpython/text-analyzer │
│ Source │ HUD Registry │
│ Built │ 2024-01-15T10:30:00Z │
│ HUD Version │ 0.1.0 │
│ Size │ 156 MB │
│ Tools │ 8 │
└─────────────┴──────────────────────────┘
🔧 Available Tools
├── setup - Initialize environment
├── evaluate - Return environment state
├── analyze_text - Analyze text sentiment
├── extract_entities - Extract named entities
└── ... 4 more tools
🔑 Environment Variables
├── Required: OPENAI_API_KEY
└── Optional: DEBUG, LOG_LEVEL
Pull hudpython/text-analyzer:latest? [y/N]: y
```
### Skip Confirmation
Auto-pull for scripts:
```bash
hud pull myorg/myenv:v1.0.0 --yes
```
### Verify Only
Check metadata without downloading:
```bash
hud pull myorg/myenv:latest --verify-only
```
### From Lock File
Pull using a shared lock file:
```bash
# Someone shares their hud.lock.yaml
hud pull ./shared-env.lock.yaml
```
### Custom Lock Location
Save lock file to specific path:
```bash
hud pull myorg/myenv:latest --lock ~/.hud/locks/myenv.yaml
```
## Metadata Sources
The command tries these sources in order:
1. **HUD Registry** (Fastest)
* Full tool descriptions
* Build metadata
* Environment requirements
2. **Docker Registry** (Limited)
* Basic image info
* Size and layers
* No tool details
3. **Lock File** (Complete)
* Full environment snapshot
* Reproducible reference
* Offline capable
## Lock File Storage
Downloaded lock files are saved to:
```
~/.hud/locks/{org}_{name}_{tag}.yaml
```
Examples:
* `hudpython_text-analyzer_latest.yaml`
* `myorg_browser-env_v1.0.0.yaml`
Lock files are cached locally, so repeated `hud analyze` commands are instant.
## Next Steps
After pulling, the command shows:
```
✅ Image pulled successfully
✅ Lock file saved to: ~/.hud/locks/hudpython_text-analyzer_latest.yaml
Next Steps:
1. Quick analysis: hud analyze hudpython/text-analyzer
2. Live analysis: hud analyze hudpython/text-analyzer --live
3. Run locally: hud run hudpython/text-analyzer:latest
```
## Registry Access
### Public Environments
No authentication needed:
```bash
# HUD's public environments
hud pull hudpython/code-analyzer:latest
hud pull hudpython/web-browser:latest
```
### Private Registries
Requires Docker login:
```bash
# Login first
docker login private-registry.com
# Then pull
hud pull private-registry.com/team/tool:latest
```
### API Key Benefits
With `HUD_API_KEY` set:
* Access private HUD metadata
* Faster metadata retrieval
* Additional environment details
```bash
export HUD_API_KEY="your-key"
hud pull myorg/private-env:latest
```
## Troubleshooting
### Common Issues
The environment may not have been pushed to HUD registry:
* Metadata will be limited to Docker manifest info
* Ask the author to run `hud push`
* Use `--verbose` to see fallback behavior
Check Docker authentication:
```bash
# Verify login
docker login
# Test manual pull
docker pull myorg/myenv:latest
```
HUD registry may be slow or unavailable:
```bash
# Skip HUD registry, use Docker only
hud pull myorg/myenv:latest --verbose
```
### Network Issues
Behind a proxy? Configure Docker:
```bash
export HTTP_PROXY=http://proxy:8080
export HTTPS_PROXY=http://proxy:8080
```
## Integration Examples
### CI/CD Pipeline
```yaml
# GitHub Actions
- name: Pull HUD environment
run: |
hud pull myorg/test-env:latest --yes
hud analyze myorg/test-env:latest
```
### Docker Compose
```yaml
services:
mcp-env:
image: myorg/myenv:latest
environment:
- API_KEY=${API_KEY}
```
### Scripted Setup
```bash
#!/bin/bash
# Pull multiple environments
ENVS=("web-browser" "code-analyzer" "data-processor")
for env in "${ENVS[@]}"; do
echo "Pulling $env..."
hud pull "hudpython/$env:latest" --yes
done
```
## Best Practices
1. **Version Pinning**: Use specific tags
```bash
hud pull myorg/tool:v1.2.3 # Good
hud pull myorg/tool:latest # Risky
```
2. **Cache Lock Files**: Share lock files for reproducibility
```bash
# Save to project
hud pull myorg/env:v1 --lock ./env.lock.yaml
git add env.lock.yaml
```
3. **Verify First**: Check before downloading large images
```bash
hud pull myorg/large-env:latest --verify-only
```
4. **Document Requirements**: Note any required env vars
```bash
# After pull, check requirements
hud analyze myorg/env:latest
```
## See Also
* [`hud push`](/reference/cli/push) - Share environments
* [`hud analyze`](/reference/cli/analyze) - Inspect pulled environments
* [`hud run`](/reference/cli/run) - Execute pulled environments
* [`hud debug`](/reference/cli/debug) - Test pulled environments
* [Build Environments](/build-environments/registry) - Registry guide
# hud push
Source: https://docs.hud.so/reference/cli/push
Share HUD environments to Docker and HUD registries
The `hud push` command publishes your HUD environment to both Docker registry and HUD registry, making it available for others to use with `hud pull`.
## Usage
```bash
hud push [DIRECTORY] [OPTIONS]
```
## Arguments
Environment directory containing hud.lock.yaml
## Options
Override registry image name (e.g., `myorg/myenv`). Short: `-i`
Override tag (e.g., `v1.0`, `latest`). Short: `-t`
Sign the image with cosign (not yet implemented)
Skip confirmation prompts. Short: `-y`
Show detailed output. Short: `-v`
## Prerequisites
`hud push` requires an API key. Set it via environment variable:
```bash
export HUD_API_KEY="your-api-key"
```
You must be logged in to a Docker registry:
```bash
docker login
# or
docker login ghcr.io
```
## What It Does
Reads `hud.lock.yaml` to get image reference and metadata
Creates registry tag from local image:
* Uses `--image` if provided
* Auto-detects from Docker config
* Defaults to Docker Hub format
Pushes image to Docker registry and gets digest
Updates `hud.lock.yaml` with:
* Registry image reference with digest
* Push timestamp
* Registry information
Sends lock file to HUD registry at:
`{HUD_API_URL}/registry/envs/{org}/{name}:{tag}`
## Registry Detection
The command intelligently detects your registry:
1. **Explicit**: Use `--image` flag
```bash
hud push . --image ghcr.io/myorg/myenv
```
2. **Docker Config**: Reads from `~/.docker/config.json`
```bash
docker login ghcr.io
hud push # Auto-uses ghcr.io
```
3. **Default**: Docker Hub format
```bash
hud push # Creates: yourusername/envname
```
## Examples
### Basic Push
Push to default registry:
```bash
# After hud build
hud push
```
Output:
```
🚀 HUD Environment Push
📋 Local Details
Image: hud-my-env:latest
Lock: hud.lock.yaml (2.3 KB)
Tools: 5 tools available
🎯 Registry Target
Local: hud-my-env:latest
Registry: myusername/my-env:latest
Push to registry? [y/N]: y
✅ Tagged image
✅ Pushed to registry: myusername/my-env@sha256:abc123...
✅ Updated lock file
✅ Metadata uploaded to HUD registry
```
### Custom Registry
Push to GitHub Container Registry:
```bash
hud push . --image ghcr.io/myorg/my-tool --tag v1.0.0
```
### CI/CD Automation
Skip prompts for automation:
```bash
hud push . --yes --tag "prod-$(date +%Y%m%d)"
```
### Override Tag
Push with different tag:
```bash
hud push . --tag experimental
```
## Lock File Updates
After pushing, `hud.lock.yaml` gains registry information:
```yaml
version: "1.0"
image: "myorg/myenv:latest@sha256:abc123..."
build:
generatedAt: "2024-01-15T10:30:00Z"
hudVersion: "0.1.0"
directory: "my-env"
push:
source: "hud-my-env:latest"
pushedAt: "2024-01-15T11:00:00Z"
registry: "ghcr.io"
environment:
# ... environment details
tools:
# ... tool information
```
## Sharing Workflows
### Public Sharing
After pushing, others can use:
```bash
# Pull with metadata preview
hud pull myorg/myenv:latest
# Use in MCP configs
{"hud": {"registry": "myorg/myenv:latest"}}
# Direct Docker usage
docker run myorg/myenv:latest
```
### Private Registries
For private registries, users need:
1. Docker registry access:
```bash
docker login private-registry.com
```
2. HUD API key (if using private HUD registry):
```bash
export HUD_API_KEY="shared-key"
```
## Usage Messages
After successful push, helpful usage examples are shown:
```
📦 Usage:
Test locally:
hud run myorg/myenv:latest
Use in MCP configs:
Claude Desktop:
{"docker": {"image": "myorg/myenv@sha256:...", "command": "auto"}}
Via HUD (recommended):
{"hud": {"registry": "myorg/myenv:latest"}}
Share environment:
Others can now pull with: hud pull myorg/myenv:latest
```
## Troubleshooting
### Authentication Issues
```bash
# Docker Hub
docker login
# GitHub Registry
docker login ghcr.io -u USERNAME
# Custom registry
docker login my-registry.com
```
```bash
# Set API key
export HUD_API_KEY="your-key-here"
# Or create .env file
echo "HUD_API_KEY=your-key-here" > .env
```
Ensure you have push access to the registry namespace:
* Docker Hub: Must own the namespace
* GitHub: Must have package write permissions
* Private: Check with your admin
### Push Failures
For large images, push may timeout. Use `--verbose` to see progress:
```bash
hud push . --verbose
```
If metadata upload fails, the Docker image is still pushed. You can retry metadata upload or share `hud.lock.yaml` manually.
## Best Practices
1. **Version Tags**: Use semantic versioning
```bash
hud push . --tag v1.0.0
hud push . --tag v1.1.0-beta
```
2. **Namespace Organization**: Group related environments
```bash
hud push . --image myorg/tools/browser
hud push . --image myorg/tools/scraper
```
3. **Documentation**: Update README with pull instructions
````markdown
## Installation
```bash
hud pull myorg/myenv:latest
````
```
```
4. **CI Integration**: Automate pushes
```yaml
- name: Push HUD environment
env:
HUD_API_KEY: ${{ secrets.HUD_API_KEY }}
run: hud push . --yes --tag ${{ github.ref_name }}
```
## Security
The HUD registry only stores metadata (lock files), not code or images. Actual images are stored in Docker registries with their existing security models.
* **API Keys**: Keep `HUD_API_KEY` secret
* **Registry Auth**: Use Docker's credential helpers
* **Image Signing**: Coming soon with `--sign` flag
* **Private Envs**: Use private Docker registries
## See Also
* [`hud build`](/reference/cli/build) - Build before pushing
* [`hud pull`](/reference/cli/pull) - Get shared environments
* [`hud analyze`](/reference/cli/analyze) - Inspect pushed environments
* [Build Environments](/build-environments/registry) - Registry guide
# hud run
Source: https://docs.hud.so/reference/cli/run
Execute MCP environments locally or remotely for production use
The `hud run` command executes MCP environments either locally via Docker or remotely via the HUD platform. Unlike `hud dev` which is for development, `hud run` is optimized for production execution.
## Usage
```bash
hud run [OPTIONS] [-- DOCKER_ARGS...]
```
## Arguments
Docker image reference (e.g., `myorg/env:latest`)
Additional Docker arguments after `--` (e.g., `-e KEY=value`)
Always use `--` before Docker arguments to avoid conflicts with HUD options:
* ✅ `hud run image:tag -- -e KEY=value`
* ❌ `hud run image:tag -e KEY=value`
## Options
Run locally with Docker instead of remote execution
Explicitly run remotely (default behavior)
Transport protocol: `stdio` or `http`. Short: `-t`
Port for HTTP transport (ignored for stdio). Short: `-p`
Remote MCP server URL
API key for remote server (defaults to `HUD_API_KEY` env var)
Run ID for tracking (remote only)
Show detailed output. Short: `-v`
## Remote Execution (Default)
By default, `hud run` executes your Docker image on the HUD platform's remote infrastructure. This provides:
* **Scalability**: Run hundreds of concurrent instances
* **Resource Management**: Automatic container lifecycle management
* **Monitoring**: Live telemetry and performance metrics
* **Geographic Distribution**: Global edge deployment
### Remote Examples
```bash
# Basic remote execution
hud run hudpython/text-2048:latest
# With environment variables
hud run myorg/server:v1 -- -e API_KEY=secret -e DEBUG=true
# HTTP transport with custom port
hud run myorg/server:v1 --transport http --port 9000
# Custom remote URL and API key
hud run my-image:latest --url https://custom.mcp.server --api-key my-key
# With run tracking
hud run production-env:v1.0 --run-id "deploy-$(date +%Y%m%d)"
```
### Authentication
Remote execution requires authentication via API key:
```bash
# Set environment variable (recommended)
export HUD_API_KEY=your-api-key-here
hud run my-image:latest
# Or pass directly (less secure)
hud run my-image:latest --api-key your-api-key-here
```
### Run Tracking
Use `--run-id` to track specific execution instances:
```bash
hud run my-image:latest --run-id "experiment-001"
```
## Local Execution
Use `--local` to run the image locally with Docker:
```bash
# Basic local execution
hud run hudpython/text-2048:latest --local
# With Docker arguments
hud run myorg/server:v1 --local -- -e API_KEY=secret -v /data:/app/data
# HTTP transport locally
hud run myorg/server:v1 --local --transport http --port 8080
# With memory limits and volumes
hud run my-env:latest --local -- --memory 2g -v $(pwd)/data:/app/data
```
Local execution is useful for:
* Development and testing
* Environments without internet access
* Custom Docker configurations
* Debugging before remote deployment
## Transport Modes
### Stdio Transport (Default)
```bash
hud run my-image:latest --transport stdio
```
**Characteristics:**
* Direct JSON-RPC over stdin/stdout
* Single client connection
* Lower latency
* MCP protocol native
### HTTP Transport
```bash
hud run my-image:latest --transport http --port 8765
```
**Characteristics:**
* HTTP proxy at specified port
* Multiple concurrent connections
* Web-compatible
* Inspector support
## Docker Arguments
Pass Docker arguments after `--` separator:
```bash
# Environment variables
hud run my-image:latest -- -e DEBUG=true -e API_KEY=secret
# Volume mounts
hud run my-image:latest -- -v /host/data:/container/data
# Network settings
hud run my-image:latest -- --network host
# Memory limits
hud run my-image:latest -- --memory 2g
# Combined arguments
hud run my-image:latest -- -e API_KEY=secret -v /data:/app/data --memory 1g
```
The `--` separator is required to distinguish HUD options from Docker arguments.
## Environment Variables
| Variable | Description | Default |
| ------------- | --------------------------------- | --------------------------- |
| `HUD_MCP_URL` | Remote MCP server URL | `https://mcp.hud.so/v3/mcp` |
| `HUD_API_KEY` | API key for remote authentication | *required for remote* |
## Comparison with Other Commands
| Command | Purpose | Hot-reload | Target Use Case |
| ----------- | ----------------------------- | ---------- | ----------------------------- |
| `hud dev` | Development with auto-restart | ✅ Yes | Local development iteration |
| `hud run` | Production execution | ❌ No | Production workloads, testing |
| `hud debug` | Environment validation | ❌ No | Debugging and validation |
## Examples
### Basic Usage
```bash
# Run remotely (default)
hud run hudpython/text-2048:latest
# Run locally for testing
hud run hudpython/text-2048:latest --local
# Verbose output for debugging
hud run my-image:latest --local --verbose
```
### Production Deployment
```bash
# Remote execution with tracking
export HUD_API_KEY=your-production-key
hud run production-env:v1.0 --run-id "prod-deploy-$(date +%s)"
# With environment configuration
hud run production-env:v1.0 --run-id "prod-$(git rev-parse --short HEAD)" -- \
-e DATABASE_URL=postgres://... \
-e REDIS_URL=redis://... \
-e LOG_LEVEL=info
```
### Development Testing
```bash
# Test locally before remote deployment
hud run my-env:dev --local -- -e DEBUG=true
# Test with HTTP transport
hud run my-env:dev --local --transport http --port 9000
# Test with custom data volume
hud run my-env:dev --local -- -v $(pwd)/test-data:/app/data
```
### CI/CD Integration
```yaml
# GitHub Actions example
- name: Run HUD environment
env:
HUD_API_KEY: ${{ secrets.HUD_API_KEY }}
run: |
# Pull latest
hud pull myorg/env:latest --yes
# Run tests
hud run myorg/env:latest --run-id "ci-${{ github.sha }}"
```
## Error Handling
`hud run` provides clear error messages for common issues:
* **Missing image**: Suggests available images or build commands
* **Authentication failure**: Guides through API key setup
* **Port conflicts**: Suggests alternative ports
* **Docker errors**: Shows Docker-specific troubleshooting
## See Also
* [`hud pull`](/reference/cli/pull) - Get environments before running
* [`hud analyze`](/reference/cli/analyze) - Inspect capabilities first
* [`hud debug`](/reference/cli/debug) - Test environment compliance
* [`hud dev`](/reference/cli/dev) - Development with hot-reload
* [Build Environments](/build-environments) - Environment development guide
# Environments
Source: https://docs.hud.so/reference/environments
SDK reference for building MCP environments
The HUD SDK provides `MCPServer` for building MCP-compatible environments that work with any MCP client.
## MCPServer
```python
from hud.server import MCPServer
```
Enhanced FastMCP server with Docker-friendly features for building HUD environments.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| ------------------ | ----- | ------------------------------- | -------- |
| `name` | `str` | Server name for MCP handshake | Required |
| `instructions` | `str` | Server instructions/description | `None` |
| `**fastmcp_kwargs` | `Any` | Additional FastMCP parameters | - |
**Key Features:**
1. **SIGTERM handling** - Graceful shutdown in containers via custom runner
2. **Initialize decorator** - Async setup during MCP initialize request
3. **Shutdown decorator** - Cleanup after lifespan context ends
4. **Enhanced add\_tool()** - Automatically handles BaseTool instances
5. **FastMCP inheritance** - All FastMCP methods available (`mount`, `resource`, `tool`)
### Decorators
#### @initialize
Run async setup during MCP initialize request:
```python
mcp = MCPServer(name="my-env")
@mcp.initialize
async def setup_environment(ctx):
"""
Initialize environment resources.
Args:
ctx: RequestContext with:
- ctx.meta: Client metadata dict
- ctx.session: MCP ServerSession
"""
# Access metadata from agent (if provided)
if ctx.meta:
progress_token = ctx.meta.get("progressToken")
display_width = ctx.meta.get("display_width", 1920)
display_height = ctx.meta.get("display_height", 1080)
# Send progress notifications
if progress_token:
await ctx.session.send_progress_notification(
progress_token=progress_token,
progress=50,
total=100,
message="Initializing environment..."
)
```
#### @shutdown
Run cleanup after server lifespan ends:
```python
@mcp.shutdown
async def cleanup():
"""Clean up resources on shutdown."""
if browser_provider:
browser_provider.close()
logger.info("Cleanup complete")
```
### Tool Registration
Three ways to register tools:
```python
# 1. Decorator for simple functions
@mcp.tool()
async def my_tool(param: str) -> dict:
return {"result": param}
# 2. Add BaseTool instances
from hud.tools import BashTool
bash = BashTool()
mcp.add_tool(bash) # Automatically uses bash.mcp internally
# 3. Add non-BaseTool instances directly
from custom import PlaywrightTool
playwright = PlaywrightTool()
mcp.add_tool(playwright) # Added as-is
```
### Hub Pattern (mount)
Use BaseHub for organized tool namespaces:
```python
from hud.tools import BaseHub
# Create hub
setup_hub = BaseHub("setup")
# Add internal tools (hidden from agents)
@setup_hub.tool("board")
async def setup_board(size: int = 4):
game = setup_hub.env
game.reset(size=size)
return [TextContent(text=f"{size}x{size} board initialized")]
# Mount hub on server
mcp.mount(setup_hub)
# Agents call via dispatcher: setup(name="board", arguments={"size": 4})
```
### Resources
Expose metadata via MCP resources:
```python
@mcp.resource("telemetry://live")
async def get_telemetry():
"""Expose live telemetry data."""
return {
"provider": os.getenv("BROWSER_PROVIDER"),
"status": "running" if browser_provider else "stopped",
"live_url": browser_provider.get_live_view_url() if browser_provider else None,
"timestamp": datetime.now().isoformat()
}
```
### Running the Server
```python
if __name__ == "__main__":
# Run with SIGTERM handling (stdio by default)
mcp.run()
# Or specify transport
mcp.run(transport="sse", port=8080)
```
## Real Environment Examples
### Minimal Environment
```python
# src/hud_controller/server.py
from hud.server import MCPServer
from mcp.types import TextContent
mcp = MCPServer(name="counter-env")
counter = {"value": 0}
@mcp.tool()
async def setup(start_value: int = 0):
"""Initialize counter."""
counter["value"] = start_value
return {"status": "ready", "counter": counter["value"]}
@mcp.tool()
async def increment():
"""Increment counter."""
counter["value"] += 1
return [TextContent(text=f"Counter: {counter['value']}", type="text")]
@mcp.tool()
async def evaluate(target: int):
"""Check if target reached."""
from hud.tools.types import EvaluationResult
return EvaluationResult(
reward=1.0 if counter["value"] >= target else 0.0,
done=counter["value"] >= target
)
if __name__ == "__main__":
mcp.run()
```
### text\_2048 Environment
From `environments/text_2048/src/hud_controller/server.py`:
```python
from hud.server import MCPServer
from .game import Game2048
from .tools import MoveTool
from .setup import setup as setup_hub
from .evaluate import evaluate as evaluate_hub
mcp = MCPServer(name="text-2048")
game = None
@mcp.initialize
async def initialize_environment(ctx):
global game
# Progress notifications
progress_token = getattr(ctx.meta, "progressToken", None) if ctx.meta else None
async def send_progress(progress: int, message: str):
if progress_token:
await ctx.session.send_progress_notification(
progress_token=progress_token,
progress=progress,
total=100,
message=message
)
await send_progress(0, "Starting 2048 game environment...")
# Create game
game = Game2048()
game.reset()
await send_progress(50, "Setting up game board...")
# Set game on hubs
setup_hub.env = game
evaluate_hub.env = game
# Mount hubs
mcp.mount(setup_hub)
mcp.mount(evaluate_hub)
await send_progress(70, "Configuring tools...")
# Add move tool
mcp.add_tool(MoveTool(env=game))
await send_progress(100, "2048 environment ready")
```
### remote\_browser Environment
From `environments/remote_browser/src/hud_controller/server.py`:
```python
from hud.server import MCPServer
from hud.tools.computer import HudComputerTool, AnthropicComputerTool, OpenAIComputerTool
from .tools import PlaywrightToolWithMemory, BrowserExecutor
from .setup import setup as setup_hub
from .evaluate import evaluate as evaluate_hub
from .providers import get_provider
mcp = MCPServer(
name="HUD Remote Browser Environment",
instructions="""Remote browser automation environment..."""
)
# Global state
browser_provider = None
playwright_tool = None
@mcp.resource("telemetry://live")
async def get_telemetry_resource():
"""MCP resource with live browser status."""
return {
"provider": os.getenv("BROWSER_PROVIDER", "unknown"),
"status": "running" if browser_provider else "stopped",
"live_url": browser_provider.get_live_view_url() if browser_provider else None,
"cdp_url": browser_provider.cdp_url if browser_provider else None
}
@mcp.initialize
async def initialize_environment(ctx):
global browser_provider, playwright_tool
# Get metadata
metadata = ctx.meta
progress_token = metadata.get("progressToken", None)
# Initialize provider
provider_name = os.getenv("BROWSER_PROVIDER")
provider_class = get_provider(provider_name)
browser_provider = provider_class(config)
# Launch browser
cdp_url = await browser_provider.launch()
# Create playwright tool
playwright_tool = PlaywrightToolWithMemory(cdp_url=cdp_url)
await playwright_tool._ensure_browser()
# Add playwright tool (not a BaseTool, added directly)
mcp.add_tool(playwright_tool)
# Create computer tools
executor = BrowserExecutor(playwright_tool)
tool_kwargs = {"executor": executor}
# Add display dimensions from metadata
if metadata:
width = metadata.get("display_width")
height = metadata.get("display_height")
if width and height:
tool_kwargs["width"] = width
tool_kwargs["height"] = height
# Add computer tools (all are BaseTool subclasses)
mcp.add_tool(HudComputerTool(**tool_kwargs))
mcp.add_tool(AnthropicComputerTool(**tool_kwargs))
mcp.add_tool(OpenAIComputerTool(**tool_kwargs))
# Mount hubs
setup_hub.env = playwright_tool
evaluate_hub.env = playwright_tool
mcp.mount(setup_hub)
mcp.mount(evaluate_hub)
@mcp.shutdown
async def shutdown_environment():
"""Cleanup browser resources."""
global browser_provider
if browser_provider:
browser_provider.close()
browser_provider = None
```
## Standard Structure
### Directory Layout
```
my-environment/
├── Dockerfile
├── pyproject.toml
├── README.md
└── src/
└── hud_controller/ # Standard package name
├── __init__.py
├── server.py # Entry point with MCPServer
├── context.py # Environment state (optional)
├── setup/ # Setup hub (optional)
│ ├── __init__.py # Creates BaseHub("setup")
│ └── *.py # Setup functions
├── evaluate/ # Evaluate hub (optional)
│ ├── __init__.py # Creates BaseHub("evaluate")
│ └── *.py # Evaluator functions
└── tools/ # Custom tools (optional)
└── *.py # Tool implementations
```
### Dockerfile
```dockerfile
FROM python:3.11-slim
WORKDIR /app
# Install dependencies
COPY pyproject.toml ./
RUN pip install -e .
# Copy source
COPY src/ ./src/
# Critical: stdio transport for MCP
CMD ["python", "-m", "hud_controller.server"]
```
### Hub Module Pattern
Example from text\_2048:
```python
# src/hud_controller/setup/__init__.py
from hud.tools.base import BaseHub
setup = BaseHub("setup")
# Import all setup functions to register them
from . import board
__all__ = ["setup"]
# src/hud_controller/setup/board.py
from . import setup
@setup.tool("board")
async def setup_board(board_size: int = 4):
"""Initialize game board."""
game = setup.env # Access environment from hub
game.reset(size=board_size)
return [TextContent(text=f"{board_size}x{board_size} game initialized")]
```
## Key Concepts
### Environment State
Three patterns for managing state:
1. **Global variables** (simple environments):
```python
game = None
@mcp.initialize
async def initialize_environment(ctx):
global game
game = Game2048()
```
2. **Context class** (complex environments):
```python
class EnvironmentContext:
def __init__(self):
self.browser = None
self.page = None
env = EnvironmentContext()
```
3. **Hub env attribute** (for tool access):
```python
setup_hub.env = game # Tools access via hub.env
```
### Tool Lifecycle
1. **Setup tools** - Hidden from agents, prepare environment state
2. **Interaction tools** - Available to agents for control
3. **Evaluate tools** - Hidden from agents, score performance
### Progress Notifications
Send [progress updates](https://modelcontextprotocol.io/specification/basic/utilities/progress) during long-running operations:
```python
async def send_progress(progress: int, message: str):
if progress_token:
await ctx.session.send_progress_notification(
progress_token=progress_token,
progress=progress,
total=100,
message=message
)
```
Progress notifications follow the [MCP progress specification](https://modelcontextprotocol.io/specification/basic/utilities/progress#progress-flow). The `progressToken` comes from the client's request [metadata](https://modelcontextprotocol.io/specification/basic/index#_meta).
### Metadata Access
Agent metadata flows through initialization:
```python
@mcp.initialize
async def initialize_environment(ctx):
# From agent's metadata class variable
width = ctx.meta.get("display_width", 1920) if ctx.meta else 1920
height = ctx.meta.get("display_height", 1080) if ctx.meta else 1080
```
## Testing
```bash
# CLI testing
hud debug my-env:latest
hud analyze my-env:latest
# Python testing
async def test():
from hud.clients import MCPClient
client = MCPClient({
"env": {"command": "docker", "args": ["run", "-i", "my-env"]}
})
async with client:
tools = await client.list_tools()
result = await client.call_tool("setup", {"value": 0})
```
## See Also
* [Build Environments](/build-environments) - Conceptual guide
* [Tools](/reference/tools) - Tool implementation reference
* [Advanced Design](/build-environments/registry) - Hub pattern details
# Tasks
Source: https://docs.hud.so/reference/tasks
SDK reference for task configuration and dataset utilities
The HUD SDK provides the `Task` class for defining agent objectives and dataset utilities for managing task collections.
## Task Class
```python
from hud.datasets import Task
```
Pydantic model that defines an agent's objective, setup, and evaluation criteria.
**Fields:**
| Field | Type | Description | Default |
| --------------- | ------------------------------------------ | ------------------------------------ | -------- |
| `id` | `str \| None` | Unique identifier (UUID recommended) | `None` |
| `prompt` | `str` | Task instruction for the agent | Required |
| `mcp_config` | `dict[str, Any]` | MCP server configuration | Required |
| `setup_tool` | `MCPToolCall \| list[MCPToolCall] \| None` | Tool(s) to prepare environment | `None` |
| `evaluate_tool` | `MCPToolCall \| list[MCPToolCall] \| None` | Tool(s) to score performance | `None` |
| `system_prompt` | `str \| None` | Additional system prompt | `None` |
| `metadata` | `dict[str, Any]` | Extra task metadata | `{}` |
### Environment Variable Substitution
The `mcp_config` field automatically resolves environment variables using `${VAR_NAME}` syntax:
```python
task = Task(
prompt="Navigate to the dashboard",
mcp_config={
"browser": {
"url": "${HUD_MCP_URL:https://mcp.hud.so/v3/mcp}",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Mcp-Image": "hudpython/hud-browser:latest"
}
}
}
)
```
**Substitution happens automatically when Task is created from a dict** - environment variables are resolved using Python's `Template.substitute()` with a defaultdict that returns empty strings for missing variables.
### Field Validators
Task automatically:
1. **Parses JSON strings** - `mcp_config` and `metadata` can be JSON strings
2. **Converts dicts to MCPToolCall** - `setup_tool` and `evaluate_tool` dicts are converted
3. **Resolves environment variables** - Only when created from dict (preserves templates in model\_dump())
```python
# All of these work:
task = Task(
prompt="Test",
mcp_config='{"server": {"url": "..."}}', # JSON string → dict
setup_tool='{"name": "setup"}' # JSON string → MCPToolCall
)
task = Task(
prompt="Test",
mcp_config={...},
setup_tool={"name": "setup", "arguments": {...}} # dict → MCPToolCall
)
task = Task(
prompt="Test",
mcp_config={...},
evaluate_tool=[ # List of tool calls
{"name": "check_text", "arguments": {"text": "Success"}},
{"name": "check_url", "arguments": {"url": "example.com"}}
]
)
```
## Dataset Functions
### run\_dataset
```python
from hud.datasets import run_dataset
results = await run_dataset(
name="My Evaluation",
dataset="hud-evals/SheetBench-50", # or Dataset object or list[dict]
agent_class=ClaudeAgent,
agent_config={"model": "claude-3-5-sonnet"},
max_concurrent=50,
max_steps=40
)
```
Run all tasks in a dataset with automatic parallelization and job tracking.
**Parameters:**
| Parameter | Type | Description | Default |
| ---------------------- | ------------------------------ | -------------------------------------------- | --------- |
| `name` | `str` | Job name for tracking | Required |
| `dataset` | `str \| Dataset \| list[dict]` | HF dataset ID, Dataset object, or task dicts | Required |
| `agent_class` | `type[MCPAgent]` | Agent class to instantiate | Required |
| `agent_config` | `dict[str, Any] \| None` | Constructor kwargs for agent | `None` |
| `max_concurrent` | `int` | Maximum parallel tasks | `50` |
| `metadata` | `dict[str, Any] \| None` | Job metadata | `None` |
| `max_steps` | `int` | Max steps per task | `40` |
| `split` | `str` | Dataset split when loading by ID | `"train"` |
| `auto_respond` | `bool` | Use ResponseAgent for continuations | `False` |
| `custom_system_prompt` | `str \| None` | Override dataset system prompt | `None` |
**Returns:** `list[Trace]` - Results in dataset order
**Features:**
* Automatic job tracking with `hud.job()`
* System prompt loading from dataset's `system_prompt.txt`
* Parallel execution with semaphore control
* Progress tracking with individual traces
* Graceful error handling (continues on failures)
**Example:**
```python
# From HuggingFace dataset ID
results = await run_dataset(
"SheetBench Evaluation",
"hud-evals/SheetBench-50",
ClaudeAgent,
{"model": "claude-3-5-sonnet-20241022"}
)
# From Dataset object
from datasets import load_dataset
dataset = load_dataset("hud-evals/OSWorld-Verified", split="train")
results = await run_dataset("OSWorld Test", dataset, OperatorAgent)
# From list of dicts
tasks = [
{
"prompt": "Click submit",
"mcp_config": {...},
"setup_tool": {"name": "setup"}
}
]
results = await run_dataset("Custom", tasks, ClaudeAgent)
```
### fetch\_system\_prompt\_from\_dataset
```python
from hud.datasets import fetch_system_prompt_from_dataset
system_prompt = await fetch_system_prompt_from_dataset("hud-evals/SheetBench-50")
```
Fetch `system_prompt.txt` from a HuggingFace dataset repository.
**Returns:** `str | None` - System prompt text if found
**Note:** Requires `huggingface_hub` to be installed.
### save\_tasks
```python
from hud.datasets import save_tasks
# IMPORTANT: Pass dicts, not Task objects!
task_dicts = [task.model_dump() for task in tasks]
save_tasks(
tasks=task_dicts,
repo_id="my-org/my-tasks",
private=False,
dataset_name="My Tasks v1.0"
)
```
Save tasks to HuggingFace dataset with JSON string serialization.
**Parameters:**
| Parameter | Type | Description | Default |
| ---------- | ---------------------- | ------------------------------------ | -------- |
| `tasks` | `list[dict[str, Any]]` | Task dictionaries (NOT Task objects) | Required |
| `repo_id` | `str` | HuggingFace repository ID | Required |
| `**kwargs` | `Any` | Additional args for `push_to_hub()` | - |
**Important:** Always pass dictionaries to preserve environment variable templates:
```python
# ✅ GOOD - Preserves ${HUD_API_KEY} in the dataset
task_dicts = [task.model_dump() for task in tasks]
save_tasks(task_dicts, "my-dataset")
# ❌ BAD - Would expose resolved API keys!
# save_tasks(tasks, "my-dataset") # Don't do this!
```
**Dataset Schema:**
The function converts complex fields to JSON strings for clean HuggingFace schema:
* `mcp_config` → JSON string
* `setup_tool` → JSON string (if present)
* `evaluate_tool` → JSON string (if present)
* `metadata` → JSON string (if present)
## MCPToolCall Type
```python
from hud.types import MCPToolCall
```
Defines a tool invocation with arguments.
**Fields:**
| Field | Type | Description | Default |
| ----------- | ---------------- | ----------------- | -------- |
| `name` | `str` | Tool name to call | Required |
| `arguments` | `dict[str, Any]` | Tool arguments | `{}` |
## Real-World Examples
### Loading Tasks from Datasets
From `examples/run_evaluation.py`:
```python
# Load dataset
dataset = load_dataset("hud-evals/SheetBench-50", split="train")
# Get system prompt
dataset_system_prompt = await fetch_system_prompt_from_dataset("hud-evals/SheetBench-50")
# Run single task
sample_task = dataset[0]
task = Task(**sample_task) # Auto-converts dict to Task
# Or run all tasks
results = await run_dataset(
"Full SheetBench Evaluation",
dataset,
ClaudeAgent,
{"model": "claude-3-5-sonnet-20241022"},
max_concurrent=50
)
```
### Task Structure in Datasets
From `environments/text_2048/2048_taskconfigs.json`:
```json
{
"id": "2048_target_256",
"prompt": "You are playing 2048 and your goal is to reach the 256 tile.",
"mcp_config": {
"local": {
"command": "docker",
"args": ["run", "--rm", "-i", "hud-text-2048"]
}
},
"setup_tool": {
"name": "setup",
"arguments": {
"name": "board",
"arguments": {"board_size": 4}
}
},
"evaluate_tool": {
"name": "evaluate",
"arguments": {
"name": "max_number",
"arguments": {"target": 256}
}
},
"metadata": {
"difficulty": 256,
"game": "2048"
}
}
```
### Creating and Saving Tasks
```python
import os
from uuid import uuid4
# Set environment variables
os.environ["HUD_API_KEY"] = "sk-hud-..."
os.environ["BROWSER_PROVIDER"] = "anchorbrowser"
# Create tasks with env var templates
tasks = []
for url in ["example.com", "test.com"]:
task = Task(
id=str(uuid4()),
prompt=f"Navigate to {url} and find the login button",
mcp_config={
"browser": {
"url": "${HUD_MCP_URL:https://mcp.hud.so/v3/mcp}",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"X-Provider": "${BROWSER_PROVIDER}"
}
}
},
setup_tool=MCPToolCall(
name="setup",
arguments={"name": "navigate", "arguments": {"url": url}}
),
evaluate_tool=MCPToolCall(
name="evaluate",
arguments={"name": "element_exists", "arguments": {"selector": "button.login"}}
),
metadata={
"category": "navigation",
"difficulty": "easy"
}
)
tasks.append(task)
# Save to HuggingFace (preserves env var templates)
task_dicts = [t.model_dump() for t in tasks]
save_tasks(task_dicts, "my-org/navigation-tasks")
```
### Agent Integration
Tasks automatically configure agents:
```python
task = Task(
prompt="Complete the form",
mcp_config={...},
setup_tool={...},
evaluate_tool={...},
system_prompt="Be very careful with form fields"
)
agent = ClaudeAgent()
# When agent.run(task) is called:
# 1. If no mcp_client, creates one from task.mcp_config
# 2. Adds task.system_prompt to agent's system prompt
# 3. Adds lifecycle tools to agent.lifecycle_tools
# 4. Runs setup_tool (hidden from agent)
# 5. Executes task.prompt
# 6. Runs evaluate_tool (hidden from agent)
# 7. Returns Trace with evaluation reward
```
## Best Practices
1. **Use UUIDs for task IDs** - Required for HuggingFace datasets
2. **Save dictionaries, not objects** - Preserves env var templates
3. **Include system\_prompt.txt** - Upload to dataset repo root
4. **Use metadata for filtering** - Category, difficulty, tags
5. **Test locally first** - Before uploading to HuggingFace
6. **Version your datasets** - Use meaningful repo names
## Common Patterns
### Filtering Tasks
```python
# Load all tasks
dataset = load_dataset("hud-evals/WebArena-Lite", split="train")
# Filter in Python
login_tasks = [
task for task in dataset
if "login" in task["prompt"].lower()
]
# Convert and run
results = await run_dataset(
"Login Tasks Only",
login_tasks,
ClaudeAgent
)
```
### Custom System Prompts
```python
# Override dataset system prompt
results = await run_dataset(
"Custom Evaluation",
"hud-evals/SheetBench-50",
ClaudeAgent,
custom_system_prompt="You are an expert spreadsheet user. Be precise."
)
```
### Environment Variable Management
```python
# Tasks preserve templates
task = Task(
prompt="Test",
mcp_config={
"server": {
"api_key": "${API_KEY:default-key}"
}
}
)
# model_dump() preserves the template
data = task.model_dump()
assert data["mcp_config"]["server"]["api_key"] == "${API_KEY:default-key}"
# But task object has resolved value
os.environ["API_KEY"] = "real-key"
task2 = Task(**data)
# task2.mcp_config["server"]["api_key"] is now "real-key"
```
## See Also
* [Task System](/core-concepts/task-system) - Conceptual overview
* [Create Benchmarks](/evaluate-agents/create-benchmarks) - Building datasets
* [Agents](/reference/agents) - How agents use tasks
# Tools
Source: https://docs.hud.so/reference/tools
SDK reference for HUD tools and executors
The HUD SDK provides a comprehensive set of tools for environment interaction. All tools inherit from `BaseTool` and return standardized MCP content blocks.
## Base Classes
### BaseTool
```python
from hud.tools import BaseTool
```
Abstract base class for all MCP tools. Provides standardized output formatting and MCP registration.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| ------------- | ----- | --------------------------------------------------- | ------------------- |
| `env` | `Any` | Optional stateful context (game, executor, browser) | `None` |
| `name` | `str` | Tool name for MCP registration | Auto from class |
| `title` | `str` | Human-readable display name | Auto from class |
| `description` | `str` | Tool description | Auto from docstring |
**Abstract Methods:**
```python
async def __call__(self, **kwargs: Any) -> list[ContentBlock]
```
**Properties:**
* `mcp` - FastMCP FunctionTool wrapper for registration
**Usage:**
```python
class MyTool(BaseTool):
async def __call__(self, param: str) -> list[ContentBlock]:
result = await self.env.do_something(param)
return [TextContent(text=result, type="text")]
# MCPServer automatically handles BaseTool instances
tool = MyTool(env=my_context)
mcp_server.add_tool(tool) # No need for .mcp property
```
### BaseHub
```python
from hud.tools import BaseHub
```
A composition-friendly FastMCP server for organizing tools into nested namespaces.
**Key Features:**
* Internal tool dispatcher pattern
* Automatic resource catalog generation
* Hidden internal tools from MCP clients
**Usage:**
```python
hub = BaseHub("evaluators")
@hub.tool() # Internal tool, hidden from agents
async def url_match(url: str) -> EvaluationResult:
return EvaluationResult(reward=1.0, done=True)
# Access via dispatcher when mounted on server
# Agents call: evaluators(name="url_match", arguments={"url": "..."})
```
## Core Tools
### BashTool
```python
from hud.tools import BashTool
```
Execute bash commands in a persistent shell session.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| --------- | -------------- | --------------------------- | ------- |
| `session` | `_BashSession` | Pre-configured bash session | `None` |
**Tool Parameters:**
| Parameter | Type | Description | Default |
| --------- | ------ | ------------------------ | -------- |
| `command` | `str` | Bash command to execute | Required |
| `restart` | `bool` | Restart the bash session | `False` |
**Example:**
```python
bash = BashTool()
result = await bash(command="ls -la")
# Returns: [TextContent(text="file1.txt\nfile2.txt", type="text")]
# Restart session
await bash(restart=True)
```
### EditTool
```python
from hud.tools import EditTool
```
File system editor with undo support.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| -------------- | ----------------------- | --------------------- | ------- |
| `file_history` | `dict[Path, list[str]]` | Edit history per file | `None` |
**Commands:**
| Command | Parameters | Description |
| ------------- | -------------------------------- | ------------------------------- |
| `view` | `path`, `view_range` | View file or directory contents |
| `create` | `path`, `file_text` | Create new file |
| `str_replace` | `path`, `old_str`, `new_str` | Replace string in file |
| `insert` | `path`, `insert_line`, `new_str` | Insert text at line |
| `undo_edit` | `path` | Undo last edit |
**Example:**
```python
editor = EditTool()
# View file
await editor(command="view", path="/path/to/file.py", view_range=[1, 20])
# Create file
await editor(command="create", path="/new/file.py", file_text="print('hello')")
# Replace text
await editor(command="str_replace", path="/file.py",
old_str="old_text", new_str="new_text")
```
### PlaywrightTool
```python
from hud.tools import PlaywrightTool
```
Web automation using Playwright browser.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| --------- | ------ | ---------------------------- | ------- |
| `page` | `Page` | Existing Playwright page | `None` |
| `cdp_url` | `str` | Chrome DevTools Protocol URL | `None` |
**Actions:**
| Action | Parameters | Description |
| ------------------ | ---------------------------- | -------------------------- |
| `navigate` | `url`, `wait_for_load_state` | Navigate to URL |
| `screenshot` | - | Capture page screenshot |
| `click` | `selector` | Click element |
| `type` | `selector`, `text` | Type text in element |
| `get_page_info` | - | Get page title and URL |
| `wait_for_element` | `selector` | Wait for element to appear |
**Example:**
```python
tool = PlaywrightTool()
# Navigate
await tool(action="navigate", url="https://example.com")
# Click button
await tool(action="click", selector="button.submit")
# Type text
await tool(action="type", selector="input#search", text="query")
```
## Computer Control Tools
### HudComputerTool
```python
from hud.tools import HudComputerTool
```
Universal computer control with automatic scaling and executor selection.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| ---------------- | -------------------------------- | ------------------- | ----------- |
| `executor` | `BaseExecutor` | Executor to use | Auto-detect |
| `platform_type` | `"auto"`, `"xdo"`, `"pyautogui"` | Executor type | `"auto"` |
| `display_num` | `int` | X display number | From env |
| `width` | `int` | Agent screen width | 1280 |
| `height` | `int` | Agent screen height | 720 |
| `rescale_images` | `bool` | Rescale screenshots | `True` |
**Actions:**
| Action | Parameters | Description |
| ------------ | ------------------------------------------ | --------------------- |
| `screenshot` | - | Capture screen |
| `click` | `x`, `y`, `button`, `pattern`, `hold_keys` | Mouse click |
| `write` | `text`, `enter_after`, `delay` | Type text |
| `press` | `keys` | Press key combination |
| `scroll` | `x`, `y`, `scroll_x`, `scroll_y` | Scroll |
| `drag` | `start_x`, `start_y`, `end_x`, `end_y` | Drag mouse |
**Example:**
```python
computer = HudComputerTool()
# Take screenshot
await computer(action="screenshot")
# Click at coordinates
await computer(action="click", x=100, y=200)
# Type text
await computer(action="write", text="Hello World", enter_after=True)
# Press hotkey
await computer(action="press", keys=["ctrl", "c"])
```
### AnthropicComputerTool
```python
from hud.tools import AnthropicComputerTool
```
Computer control optimized for Anthropic's Claude models.
**Features:**
* Pre-configured for 1280x720 resolution
* Optimized action names for Claude
* Built-in screenshot scaling
### OpenAIComputerTool
```python
from hud.tools import OpenAIComputerTool
```
Computer control optimized for OpenAI models.
**Features:**
* Pre-configured for 1920x1080 resolution
* Simplified action interface
* No automatic screenshot scaling
## Executors
Executors provide platform-specific implementations for computer control actions.
### BaseExecutor
```python
from hud.tools.executors import BaseExecutor
```
Abstract base providing simulation mode for all actions.
**Core Methods:**
* `click(x, y, button, pattern, hold_keys)` - Mouse click
* `write(text, enter_after, delay)` - Type text
* `press(keys)` - Press key combination
* `scroll(x, y, scroll_x, scroll_y)` - Scroll
* `drag(start_x, start_y, end_x, end_y)` - Drag mouse
* `screenshot()` - Capture screen
* `get_screen_size()` - Get display dimensions
### PyAutoGUIExecutor
```python
from hud.tools.executors import PyAutoGUIExecutor
```
Cross-platform executor using PyAutoGUI library.
**Features:**
* Works on Windows, macOS, Linux
* Real mouse/keyboard control
* Screenshot capture
* Automatic failsafe
**Example:**
```python
executor = PyAutoGUIExecutor()
computer = HudComputerTool(executor=executor)
```
### XDOExecutor
```python
from hud.tools.executors import XDOExecutor
```
Linux/X11 executor using xdotool.
**Features:**
* Native X11 integration
* Faster than PyAutoGUI on Linux
* Support for X display selection
* Window management capabilities
**Example:**
```python
executor = XDOExecutor(display_num=1)
computer = HudComputerTool(executor=executor)
```
## Common Types
### ContentBlock
MCP standard output format (from `mcp.types`):
```python
from mcp.types import TextContent, ImageContent
# Text output
TextContent(text="Operation complete", type="text")
# Image output
ImageContent(data="base64_data", mimeType="image/png", type="image")
```
### EvaluationResult
```python
from hud.tools.types import EvaluationResult
result = EvaluationResult(
reward=0.8, # Score 0-1
done=True, # Task complete
content="Details", # Optional text
info={"score": 80} # Metadata
)
```
### ContentResult
```python
from hud.tools.types import ContentResult
# Helper for building complex outputs
result = ContentResult(
output="Success message",
error="Error if any",
base64_image="screenshot_data",
system="System message"
)
# Convert to ContentBlocks
blocks = result.to_content_blocks()
```
## Integration Examples
### Adding Tools to Environment
```python
from hud.server import MCPServer
from hud.tools import BashTool, EditTool
mcp = MCPServer(name="my-env")
# MCPServer handles BaseTool instances automatically
bash = BashTool()
mcp.add_tool(bash) # Internally uses bash.mcp
editor = EditTool()
mcp.add_tool(editor) # Same here
```
### Custom Tool Implementation
```python
from hud.tools import BaseTool
from mcp.types import TextContent
class DatabaseTool(BaseTool):
def __init__(self, db_connection):
super().__init__(
env=db_connection,
name="database",
title="Database Query Tool",
description="Execute SQL queries"
)
async def __call__(self, query: str) -> list[ContentBlock]:
try:
results = await self.env.execute(query)
return [TextContent(text=str(results), type="text")]
except Exception as e:
return [TextContent(text=f"Error: {e}", type="text")]
```
### Hub Pattern for Evaluators
```python
from hud.tools import BaseHub
from hud.tools.types import EvaluationResult
evaluators = BaseHub("evaluate")
@evaluators.tool("text_contains")
async def check_text(text: str, target: str) -> EvaluationResult:
return EvaluationResult(
reward=1.0 if target in text else 0.0,
done=True,
content=f"Checking if '{target}' in text"
)
# Use in environment
@mcp.tool()
async def evaluate(name: str, **kwargs):
return await evaluators.call_tool(name, kwargs)
```
## See Also
* [Build Environments](/build-environments) - Using tools in environments
* [Architecture](/core-concepts/architecture) - How tools fit in HUD
* [Adapting Software](/build-environments/adapting-software) - Tool patterns
# Datasets
Source: https://docs.hud.so/train-agents/datasets
Create and manage training datasets
Training datasets are collections of HUD tasks stored on HuggingFace. Each task defines a prompt, MCP configuration, and evaluation criteria.
## Creating Datasets
```python
from hud.datasets import Task, save_tasks
from hud.types import MCPToolCall
# Create task variations
tasks = []
for i in range(100):
tasks.append(Task(
prompt=f"Reach the {2**(9+i%4)} tile in 2048",
mcp_config={"hudpython/hud-text-2048:v1.2": {}},
setup_tool="setup_board",
evaluate_tool=MCPToolCall(
name="evaluate_max_tile",
arguments={"target": 2**(9+i%4)}
)
))
# Save to HuggingFace
save_tasks(tasks, "my-org/2048-training")
```
## Dataset Format
Tasks are stored as HuggingFace datasets with these fields:
```python
{
"id": "task-001",
"prompt": "Navigate to login page",
"mcp_config": '{"hudpython/hud-browser:latest": {}}', # JSON string
"setup_tool": '{"name": "clear_cookies"}', # JSON string
"evaluate_tool": '{"name": "check_logged_in"}', # JSON string
"metadata": '{"difficulty": "easy"}' # JSON string
}
```
## Loading Datasets
```python
from datasets import load_dataset
from hud.datasets import Task
# Load from HuggingFace
dataset = load_dataset("hud-evals/2048-taskset", split="train")
tasks = [Task(**task_dict) for task_dict in dataset]
# Filter tasks
easy_tasks = [t for t in tasks if t.metadata.get("difficulty") == "easy"]
# Use in training
env = vf.load_environment(
env_id="hud-vf-gym",
taskset="hud-evals/2048-taskset",
config_path="./configs/2048.yaml",
num_tasks=50 # Sample 50 tasks
)
```
## Curriculum Learning
Create staged datasets for progressive training:
```python
# Stage 1: Simple moves
save_tasks(create_simple_tasks(), "my-org/2048-stage1")
# Stage 2: Strategic play
save_tasks(create_medium_tasks(), "my-org/2048-stage2")
# Stage 3: Expert level
save_tasks(create_hard_tasks(), "my-org/2048-stage3")
```
## Best Practices
* **Diversity**: Include varied prompts and scenarios
* **Balance**: Mix difficulty levels appropriately
* **Size**: 100-1000 tasks per dataset typically
* **Validation**: Keep 10-20% for evaluation
## Example Datasets
* [hud-evals/2048-taskset](https://huggingface.co/datasets/hud-evals/2048-taskset) - 2048 game variations
* [hud-evals/sheetbench-50](https://huggingface.co/datasets/hud-evals/sheetbench-50) - Spreadsheet tasks
# Train Agents
Source: https://docs.hud.so/train-agents/index
Improve agent performance with reinforcement learning
Train agents to perform better on specific tasks using reinforcement learning. Small models trained on your tasks can outperform larger general models.
## Overview
HUD integrates with [Verifiers](https://github.com/willccbb/verifiers) for GRPO (Group Relative Policy Optimization) training:
```python
import verifiers as vf
# Load environment
env = vf.load_environment(
env_id="hud-vf-gym",
taskset="hud-evals/2048-taskset",
config_path="configs/2048.yaml"
)
# Train with GRPO
model, tokenizer = vf.get_model_and_tokenizer("Qwen/Qwen2.5-3B-Instruct")
trainer = vf.GRPOTrainer(
model=model,
env=env,
args=vf.grpo_defaults(),
peft_config=vf.lora_defaults() # LoRA for efficiency
)
trainer.train()
```
## Why Train?
* **Task Performance**: 3B model can reach 80% success vs 35% baseline
* **Cost**: Smaller models are cheaper to run
* **Speed**: Trained models complete tasks faster
* **Consistency**: Less variance in behavior
## Requirements
Model Size |
GPU Memory |
Training Time (100 steps) |
3B |
24GB (1x RTX 4090) |
30 minutes |
7B |
40GB (1x A100) |
1 hour |
70B |
320GB (4x A100) |
6+ hours |
## Monitoring & Analytics
Track training progress in real-time:
* **[app.hud.so](https://app.hud.so)** - View traces, metrics, and rewards for each training generation
* **Weights & Biases** - Detailed ML metrics and loss curves
* **Local logs** - Training checkpoints and evaluation results
## Getting Started
Train your first agent in 30 minutes
Configure tools with YAML
Create training datasets
# RL Quickstart
Source: https://docs.hud.so/train-agents/rl-quickstart
Train agents with reinforcement learning using GRPO
Train agents to perform better on specific tasks using GRPO (Group Relative Policy Optimization).
## Quick Start
### Prerequisites
* GPU with 24GB+ VRAM (A100, RTX 4090)
* Python 3.12+
* HUD API key (set `HUD_API_KEY` environment variable)
### Installation
```bash
# Clone repository
git clone https://github.com/hud-evals/hud-python
cd hud-python/rl
# Clone hud-vf-gym (required)
git clone https://github.com/hud-evals/hud-vf-gym.git
# Install with uv (recommended)
uv sync
source .venv/bin/activate
```
### Train 2048 Agent
```bash
# Set API key
export HUD_API_KEY="YOUR_API_KEY"
# Build the 2048 environment
cd ../../environments/text_2048/
docker build -t hud-text-2048 .
cd ../../rl/
```
```bash
# Terminal 1 - GPU 0
CUDA_VISIBLE_DEVICES=0 vf-vllm \
--model Qwen/Qwen2.5-3B-Instruct \
--enforce-eager \
--disable-log-requests
```
```bash
# Terminal 2 - GPU 1
CUDA_VISIBLE_DEVICES=1 python train_2048.py
```
The included `train_2048.py` script trains a 3B model on the 2048 game. Training takes \~30 minutes for 100 steps.

*Qwen-2.5-3B agent training on the [text-2048](https://github.com/hud-evals/hud-python/tree/main/environments/text_2048) environment using GRPO*
## How It Works
GRPO trains by comparing multiple attempts on the same task:
```mermaid
graph LR
A[Task] --> B[Generate 16 Attempts]
B --> C[Rank by Reward]
C --> D[Learn from Best]
D --> E[Improved Model]
```
## Configuration
Training uses YAML configs to map agent tools to MCP tools.:
```yaml
# From rl/configs/2048.yaml
system_prompt: |
You are an excellent 2048 player.
Available moves: left(), right(), up(), down(), done()
Return ONLY: move()
action_mappings:
left:
_tool: "move"
direction: {static: "left"}
right:
_tool: "move"
direction: {static: "right"}
# ... up, down similar
```
## Custom Training
```python
import verifiers as vf
# Load environment
env = vf.load_environment(
env_id="hud-vf-gym",
taskset="your-org/your-taskset", # HuggingFace dataset
config_path="configs/your-env.yaml", # In hud-vf-gym
num_tasks=10
)
# Configure training (matching train_2048.py)
model, tokenizer = vf.get_model_and_tokenizer("Qwen/Qwen2.5-3B-Instruct")
args = vf.grpo_defaults(run_name="my-training")
args.gradient_accumulation_steps = 2
args.per_device_train_batch_size = 8
args.num_generations = 16
args.max_steps = 100
args.learning_rate = 1e-6
# Train with LoRA
trainer = vf.GRPOTrainer(
model=model,
processing_class=tokenizer,
env=env,
args=args,
peft_config=vf.lora_defaults()
)
trainer.train()
```
## Monitoring
### HUD Dashboard
View training progress and traces at [app.hud.so](https://app.hud.so):
* Real-time screenshot and setp viewing
* Tool call traces for each trajectory
* Reward distribution across generations
### Weights & Biases
For detailed ML metrics:
```python
args.report_to = "wandb"
os.environ["WANDB_API_KEY"] = "your-key"
```
## Resources
* [Example Training Script](https://github.com/hud-evals/hud-python/blob/main/rl/train_2048.py)
* [HUD VF Gym](https://github.com/hud-evals/hud-vf-gym) - Verifiers integration
# Config System
Source: https://docs.hud.so/train-agents/verifiers-gym
YAML-based tool configuration for training
Verifiers Gym uses YAML configs to map agent-facing tools to MCP tools, enabling flexible environment configuration without code changes.
## Basic Configuration
```yaml configs/browser.yaml
# Define agent interface
system_prompt: |
You are an AI assistant that can interact with a browser.
Available tools:
- screenshot(): Take a screenshot
- click(x, y): Click at coordinates
- type(text): Type text
- done(): Complete the task
Use function(args) format.
# Map agent tools to MCP tools
action_mappings:
screenshot:
_tool: "computer"
action: "screenshot"
click:
_parser:
positional: ["x", "y"] # Expected arguments
_tool: "computer"
action: "click"
x: {from_arg: "x"}
y: {from_arg: "y"}
type:
_parser:
positional: ["text"]
_tool: "computer"
action: "type"
text: {from_arg: "text"}
```
## Advanced Features
### Argument Transformation
Transform arguments before passing to MCP:
```yaml
scroll:
_parser:
positional: ["direction", "amount"]
_tool: "computer"
action: "scroll"
coordinate: [640, 400] # Static value
direction:
from_arg: "direction"
amount:
from_arg: "amount"
transform: "lambda a: int(a * 120)" # Convert to pixels
```
### Context-Aware Transforms
Access other arguments in transforms:
```yaml
key:
_parser:
positional: ["key"]
_tool: "computer"
action: "press"
keys:
from_arg: "key"
transform: "lambda x: x.split('+')" # "ctrl+a" → ["ctrl", "a"]
```
### Static Values
Use fixed values instead of arguments:
```yaml
# 2048 game directions
left:
_tool: "move"
direction: {static: "left"}
right:
_tool: "move"
direction: {static: "right"}
```
## Parser Settings
Control agent output parsing:
```yaml
parser:
use_thinking: false # Disable tags
xml_weight: 0.6 # XML format validation weight
action_weight: 0.4 # Action syntax validation weight
rubric:
weights:
task_completion: 0.8 # Primary task reward
tool_execution: 0.1 # Tool success rate
format_compliance: 0.1 # Proper formatting
```
## Usage
```python
import verifiers as vf
# Use your config (path relative to where you run the script)
env = vf.load_environment(
env_id="hud-vf-gym",
taskset="your-org/your-taskset",
config_path=configs/your-env.yaml"
)
```
## Examples
Config files are located in the `hud-python/rl/configs/`:
* `2048.yaml` - 2048 game with directional moves
* `default.yaml` - Browser/computer interaction