# Adapting Software Source: https://docs.hud.so/build-environments/docker-to-mcp Patterns for wrapping existing applications using HUD SDK tools Learn how to adapt existing software into MCP environments using patterns from the HUD SDK. All examples link to actual implementations in [hud-evals/hud-python](https://github.com/hud-evals/hud-python). ## Browser Automation Pattern The HUD SDK provides powerful computer tools that can control any browser-based application. See how [remote\_browser](https://github.com/hud-evals/hud-python/tree/main/environments/remote_browser) implements this pattern. ### Using HUD Computer Tools ```python from hud.tools.computer import HudComputerTool from hud.tools.executors.base import BaseExecutor # Browser executor wraps Playwright class BrowserExecutor(BaseExecutor): def __init__(self, playwright_tool): self.browser = playwright_tool async def screenshot(self) -> str: # Return base64 screenshot from browser return await self.browser.screenshot() async def click(self, x: int, y: int): # Translate to browser coordinates await self.browser.click(x, y) # Use with computer tool executor = BrowserExecutor(playwright_tool) computer_tool = HudComputerTool(executor=executor) ``` See the full implementation: * [BrowserExecutor](https://github.com/hud-evals/hud-python/blob/main/environments/remote_browser/src/hud_controller/tools/executor.py) - Wraps Playwright with computer tool interface * [Server integration](https://github.com/hud-evals/hud-python/blob/main/environments/remote_browser/src/hud_controller/server.py#L202-L205) - How tools are registered ### Multiple Provider Support ```python # Support different cloud browsers provider = os.getenv("BROWSER_PROVIDER", "anchorbrowser") browser_provider = get_provider(provider) # Each provider implements the same interface cdp_url = await browser_provider.launch() browser = await playwright.connect_over_cdp(cdp_url) ``` See [providers directory](https://github.com/hud-evals/hud-python/tree/main/environments/remote_browser/src/hud_controller/providers) for implementations. ## Game Environment Pattern Simple stateful applications like games use custom tools. See [text\_2048](https://github.com/hud-evals/hud-python/tree/main/environments/text_2048) for the pattern. ### Custom Tool Implementation ```python from hud.tools.base import BaseTool from mcp.types import ContentBlock, TextContent class MoveTool(BaseTool): """Custom tool for game moves.""" def __init__(self, env): super().__init__( name="move", description="Move tiles in a direction", parameters={ "type": "object", "properties": { "direction": {"type": "string", "enum": ["up", "down", "left", "right"]} }, "required": ["direction"] } ) self.env = env async def __call__(self, direction: str) -> list[ContentBlock]: """Execute the move - this is what BaseTool calls.""" # Perform the move self.env.move(direction) # Return visual state board_display = self.env.get_board_ascii() return [TextContent(text=board_display, type="text")] ``` See [MoveTool implementation](https://github.com/hud-evals/hud-python/blob/main/environments/text_2048/src/hud_controller/tools/move.py). ## Docker Base Images HUD provides optimized base images for common use cases: ### Browser Automation ```dockerfile FROM hudpython/browser-base:latest # Includes Chromium, Playwright, and dependencies # Add your automation code COPY src/ /app/src/ CMD ["python", "-m", "hud_controller.server"] ``` ### Desktop GUI ```dockerfile FROM hudpython/novnc-base:latest # Ubuntu with XFCE, VNC, and noVNC pre-configured # Install your GUI application RUN apt-get update && apt-get install -y \ your-gui-app # VNC is available on port 5901, noVNC on 8080 ``` ### Custom Base Images When building from scratch: ```dockerfile FROM python:3.11-slim # Install system dependencies RUN apt-get update && apt-get install -y \ required-packages \ && rm -rf /var/lib/apt/lists/* # Python environment ENV PYTHONUNBUFFERED=1 \ PYTHONDONTWRITEBYTECODE=1 WORKDIR /app ``` See [base image examples](https://github.com/hud-evals/hud-python/tree/main/environments) in environment Dockerfiles. ## Environment Context Pattern Share state between tools using environment context: ```python # Define context class class BrowserEnvironmentContext: def __init__(self, browser): self.browser = browser self.page = None async def goto(self, url: str): self.page = await self.browser.new_page() await self.page.goto(url) # Share with tools in server.py @mcp.initialize async def initialize_environment(ctx): browser = await launch_browser() env = BrowserEnvironmentContext(browser) # Make env available to all tools global environment environment = env ``` ## Common Return Types All tools should return standardized types for consistency. ### ContentBlock for Tool Output ```python from mcp.types import ContentBlock, TextContent, ImageContent # Text output return [TextContent(text="Operation complete", type="text")] # Image output return [ImageContent(data=base64_data, mimeType="image/png", type="image")] # Multiple outputs return [ TextContent(text="Screenshot captured", type="text"), ImageContent(data=screenshot, mimeType="image/png", type="image") ] ``` ### EvaluationResult for Scoring ```python from hud.tools.types import EvaluationResult return EvaluationResult( reward=0.8, # Score between 0-1 done=True, # Task complete? info={ # Additional context "score": 80, "threshold": 75, "details": "Form submitted successfully" } ) ``` See [types.py](https://github.com/hud-evals/hud-python/blob/main/hud/tools/types.py) for full definitions. ## Tool Discovery and Resources Expose available functions as MCP resources: ```python @mcp.resource("setup://functions") async def list_setup_functions(): """List all available setup functions.""" return { "functions": [ {"name": name, "description": func.__doc__} for name, func in setup_functions.items() ] } ``` See [resource examples](https://github.com/hud-evals/hud-python/blob/main/environments/browser/src/hud_controller/server.py#L67-L90). ## Error Handling Patterns Consistent error handling across tools: ```python from mcp import McpError from mcp.types import INTERNAL_ERROR try: result = await risky_operation() except SpecificError as e: # Return error as content return [TextContent(text=f"Operation failed: {e}", type="text")] except Exception as e: # Fatal errors raise McpError(INTERNAL_ERROR, f"Unexpected error: {e}") ``` ## Next Steps Study complete implementations Learn hub pattern, telemetry, and metadata For step-by-step implementation guide, see the [environments README](https://github.com/hud-evals/hud-python/blob/main/environments/README.md). # Build Environments Source: https://docs.hud.so/build-environments/index Wrap any software in MCP for agent evaluation Turn any software that runs in Docker into an agent-testable environment. Whether it's a web app, desktop application, game, or API, HUD provides the tools to make it MCP-compatible. ## What Defines a HUD Environment? A HUD environment is a Docker container that: 1. **Runs an MCP server** via stdio transport 2. **Implements lifecycle tools** (`setup` and `evaluate`) 3. **Exposes interaction tools** for agents to use 4. **Works identically** locally and remotely The key is the Docker `CMD` entrypoint - it launches an MCP server that can be connected to by any MCP client, enabling both local development and remote scaling. ## Architecture & Lifecycle ```mermaid graph LR Agent[Agent] -->|MCP Protocol| Server[MCP Server] Server --> Setup[Setup Tool] Server --> Interact[Interaction Tools] Server --> Evaluate[Evaluate Tool] Setup -->|Initialize State| Environment Interact -->|Control| Environment Evaluate -->|Score Performance| Environment ``` ### Lifecycle Flow 1. **Initialize**: MCP handshake establishes connection 2. **Setup**: Prepare environment to known state (hidden from agent) 3. **Interact**: Agent uses tools to complete task 4. **Evaluate**: Score agent's performance (hidden from agent) ## Environment Specification ### Required Components Every HUD environment must have: ```python from hud.server import MCPServer mcp = MCPServer(name="my-environment") # 1. Setup tool (lifecycle - hidden from agents) @mcp.tool() async def setup(**kwargs) -> dict: """Initialize environment state""" # Reset to known state # Load test data # Configure settings return {"status": "ready"} # 2. Evaluate tool (lifecycle - hidden from agents) @mcp.tool() async def evaluate(**kwargs) -> dict: """Score agent performance""" # Check task completion # Calculate reward (0-1) # Determine if done return { "reward": 0.8, "done": True, "info": {"details": "..."} } # 3. Interaction tools (visible to agents) @mcp.tool() async def perform_action(param: str) -> dict: """Tool agents use to interact""" # Modify environment state # Return feedback return {"result": "success"} ``` ### Standard File Structure ``` my-environment/ ├── Dockerfile # Container definition ├── pyproject.toml # Python package config ├── README.md # Environment documentation └── src/ └── hud_controller/ # Standard package name ├── __init__.py ├── server.py # MCP server entry point ├── context.py # Environment state (optional) ├── setup/ # Setup functions (if using hub) ├── evaluate/ # Evaluation functions (if using hub) └── tools/ # Custom interaction tools ``` ### Docker Requirements The Dockerfile must: * Install dependencies * Copy source code * Set proper working directory * **Define CMD that runs the MCP server** ```dockerfile FROM python:3.11-slim WORKDIR /app # Copy and install COPY pyproject.toml ./ COPY src/ ./src/ RUN pip install -e . # Critical: Launch MCP server on stdio CMD ["python", "-m", "hud_controller.server"] ``` This `CMD` is what enables the same image to work both locally and remotely: * Locally: `docker run -i` pipes stdio * Remotely: HUD platform manages stdio transport ## MCPServer vs FastMCP The HUD SDK provides `MCPServer`, which extends FastMCP with Docker-friendly features: ### What MCPServer Adds 1. **SIGTERM Handling**: Graceful shutdown in containers ```python mcp = MCPServer(name="my-env") mcp.run() # Automatically handles SIGTERM ``` 2. **Initialize Decorator**: Async setup during MCP handshake ```python @mcp.initialize async def initialize_environment(ctx): # ctx.session: MCP ServerSession # ctx.meta: Metadata from client # ctx.meta.progressToken: For progress updates await setup_resources() ``` 3. **Shutdown Decorator**: Cleanup after lifespan ```python @mcp.shutdown async def cleanup(): await close_connections() ``` 4. **BaseTool Support**: Enhanced tool registration ```python from hud.tools.base import BaseTool custom_tool = MyTool() # Inherits from BaseTool mcp.add_tool(custom_tool) # Uses tool.mcp automatically ``` ## Common MCP Types ### Tool Outputs All tools should return standardized types: ```python from mcp.types import ContentBlock, TextContent, ImageContent # Text output return [TextContent(text="Operation complete", type="text")] # Image output return [ImageContent( data=base64_screenshot, mimeType="image/png", type="image" )] # Multiple outputs return [ TextContent(text="Status: Ready", type="text"), ImageContent(data=screenshot, mimeType="image/png", type="image") ] ``` ### Evaluation Results ```python from hud.tools.types import EvaluationResult return EvaluationResult( reward=0.8, # Score between 0-1 done=True, # Task complete? content="Details", # Optional description info={ # Additional metadata "score": 80, "threshold": 75 } ) ``` ## Quick Example Here's a minimal environment: ```python from hud.server import MCPServer from mcp.types import TextContent mcp = MCPServer(name="counter-env") # State counter = 0 @mcp.tool() async def setup(start_value: int = 0): """Reset counter to starting value""" global counter counter = start_value return {"status": "ready", "counter": counter} @mcp.tool() async def increment(): """Increment the counter""" global counter counter += 1 return [TextContent(text=f"Counter: {counter}", type="text")] @mcp.tool() async def evaluate(target: int): """Check if counter reached target""" global counter return { "reward": 1.0 if counter >= target else counter / target, "done": counter >= target, "info": {"counter": counter, "target": target} } if __name__ == "__main__": mcp.run() ``` ## Getting Started Build your first environment step-by-step Patterns for wrapping existing applications Hub pattern, telemetry, and production tips ## Reference Implementations Study these examples: ### text\_2048 (Simple) * ASCII game with clear state * Custom move tool * Simple evaluation logic ### remote\_browser (Complex) * Multi-provider browser automation * Hub pattern for dozens of functions * Production telemetry ## Key Concepts ### Tool Categories 1. **Setup Tools**: Initialize state (hidden from agents) 2. **Evaluate Tools**: Score performance (hidden from agents) 3. **Interaction Tools**: What agents use to control the environment ### MCP Communication * **stdout**: Reserved for MCP protocol (JSON-RPC) * **stderr**: All logs and debug output * **stdio transport**: How containers communicate ### Process Separation for Hot-Reload Development **Critical Design Principle**: For environments that need to maintain state during development (like browsers, databases, or running applications), you must separate the MCP server process from the actual environment process. This separation enables hot-reload development without losing state. The MCP server should be a lightweight process that: 1. Starts quickly and can restart without side effects 2. Connects to your environment process on initialization 3. Acts as a bridge between MCP protocol and your environment The environment process should: 1. Run as a separate, persistent service 2. Maintain state across MCP server restarts 3. Accept connections from the MCP server #### Communication Methods You can connect the MCP server to your environment using: * **Unix Sockets** (recommended for local development) ```python # In your MCP server from hud.server.context import attach_context ctx = attach_context("/tmp/my_env.sock") ``` * **TCP/HTTP API** ```python # Connect to HTTP API async with httpx.AsyncClient() as client: response = await client.get("http://localhost:8080/state") ``` * **gRPC** ```python # Connect via gRPC channel = grpc.aio.insecure_channel('localhost:50051') stub = MyServiceStub(channel) ``` * **Shared Memory/IPC** ```python # Using multiprocessing managers from multiprocessing.managers import BaseManager manager.connect() ``` #### Example: Browser Environment Pattern The browser environment demonstrates this pattern: 1. **Context Server** (`context_server.py`): Persistent process managing browser state 2. **MCP Server** (`server.py`): Lightweight server that restarts on code changes 3. **Connection**: Unix socket at `/tmp/hud_browser_ctx.sock` ```python # context_server.py - Runs once, maintains state class PersistentBrowserContext: def __init__(self): self.browser = None self.apps = {} def startup(self): # Initialize browser, X11, VNC once self.browser = launch_browser() # server.py - Can restart anytime @mcp.initialize async def initialize_environment(ctx): # Connect to persistent context persistent_ctx = attach_context("/tmp/hud_browser_ctx.sock") # Use existing browser state browser = persistent_ctx.get_browser() ``` This separation allows `hud dev` to restart the MCP server for code changes while keeping your browser, database, or application running. ### Testing Your Environment #### Traditional Testing ```bash # Debug all phases hud debug my-environment:latest # Analyze available tools hud analyze my-environment:latest ``` #### Hot-Reload Development ```bash # Start development server with live reload hud dev --build # Edit src/ files - changes are immediately reflected # Test in another terminal hud analyze hud-my-environment:dev ``` The `hud dev` command: * Mounts your `src/` directory into containers * Auto-detects or builds Docker images * Provides HTTP endpoint for Cursor integration * Eliminates rebuild cycles during development * **Restarts only the MCP server**, not your environment state For stateful environments (browsers, databases, games), implement process separation so that `hud dev` hot-reload doesn't reset your environment state. See the browser environment for a complete example. ## SDK Reference Complete API documentation for MCPServer Available tools and executors ## Next Steps For the complete step-by-step guide with all implementation details, see the [environments README](https://github.com/hud-evals/hud-python/blob/main/environments/README.md). # TODO App Tutorial Source: https://docs.hud.so/build-environments/quickstart-todo Build your first MCP environment in 15 minutes Let's build a TODO app environment from scratch. This hands-on tutorial will teach you the core concepts of MCP environment development. ## What We're Building A simple TODO list that agents can interact with: * Add, complete, and delete tasks * List all tasks with their status * Evaluate based on task completion This tutorial assumes basic Python knowledge. Full code is available at `environments/todo-example` ## Step 1: Project Structure Create a new directory for your environment: ```bash mkdir todo-env cd todo-env ``` Create the following structure: ``` todo-env/ ├── Dockerfile ├── pyproject.toml ├── README.md └── src/ └── hud_controller/ ├── __init__.py ├── server.py └── todo_app.py ``` ## Step 2: The TODO Application First, let's build the core TODO logic: ```python src/hud_controller/todo_app.py from dataclasses import dataclass from typing import List, Optional from datetime import datetime @dataclass class TodoItem: id: int title: str completed: bool = False created_at: datetime = None def __post_init__(self): if self.created_at is None: self.created_at = datetime.now() class TodoApp: def __init__(self): self.todos: List[TodoItem] = [] self.next_id = 1 def add_todo(self, title: str) -> TodoItem: """Add a new TODO item""" todo = TodoItem(id=self.next_id, title=title) self.todos.append(todo) self.next_id += 1 return todo def complete_todo(self, todo_id: int) -> bool: """Mark a TODO as completed""" for todo in self.todos: if todo.id == todo_id: todo.completed = True return True return False def delete_todo(self, todo_id: int) -> bool: """Delete a TODO item""" for i, todo in enumerate(self.todos): if todo.id == todo_id: self.todos.pop(i) return True return False def list_todos(self) -> List[dict]: """Get all TODOs as dictionaries""" return [ { "id": todo.id, "title": todo.title, "completed": todo.completed, "created_at": todo.created_at.isoformat() } for todo in self.todos ] def clear_all(self): """Reset the TODO list""" self.todos = [] self.next_id = 1 ``` ## Step 3: MCP Server Integration Now wrap the TODO app with MCP: ```python src/hud_controller/server.py import sys import logging from hud.server import MCPServer from .todo_app import TodoApp # Configure logging to stderr logging.basicConfig( stream=sys.stderr, level=logging.INFO, format='[%(levelname)s] %(asctime)s | %(name)s | %(message)s' ) # Create server and app instances mcp = MCPServer(name="TODO Environment") app = TodoApp() # Interaction tools (visible to agents) @mcp.tool() async def add_todo(title: str) -> dict: """Add a new TODO item to the list""" todo = app.add_todo(title) return { "success": True, "todo": { "id": todo.id, "title": todo.title }, "message": f"Added TODO: {title}" } @mcp.tool() async def complete_todo(todo_id: int) -> dict: """Mark a TODO item as completed""" success = app.complete_todo(todo_id) return { "success": success, "message": f"{'Completed' if success else 'Failed to complete'} TODO {todo_id}" } @mcp.tool() async def list_todos() -> dict: """List all TODO items""" todos = app.list_todos() return { "todos": todos, "total": len(todos), "completed": sum(1 for t in todos if t["completed"]) } @mcp.tool() async def delete_todo(todo_id: int) -> dict: """Delete a TODO item""" success = app.delete_todo(todo_id) return { "success": success, "message": f"{'Deleted' if success else 'Failed to delete'} TODO {todo_id}" } # Setup tool (hidden from agents) @mcp.tool() async def setup(initial_todos: list[str] = None) -> dict: """Initialize TODO list with optional items""" app.clear_all() if initial_todos: for title in initial_todos: app.add_todo(title) return { "status": "success", "message": f"Initialized with {len(initial_todos or [])} TODOs" } # Evaluation tool (hidden from agents) @mcp.tool() async def evaluate(target_completed: int = None) -> dict: """Evaluate based on TODO completion""" todos = app.list_todos() total = len(todos) completed = sum(1 for t in todos if t["completed"]) if target_completed is not None: # Specific target reward = 1.0 if completed >= target_completed else completed / target_completed else: # General completion rate reward = completed / total if total > 0 else 0.0 return { "reward": reward, "done": reward >= 1.0, "info": { "completed": completed, "total": total, "details": f"{completed}/{total} TODOs completed" } } # Initialize handler @mcp.initialize async def on_initialize(ctx): """Called when MCP connection is established""" logging.info("TODO environment ready!") if __name__ == "__main__": mcp.run() ``` ## Step 4: Package Configuration Create a `pyproject.toml` file: ```toml pyproject.toml [project] name = "todo-env" version = "0.1.0" description = "TODO app MCP environment" requires-python = ">=3.11" dependencies = [ "hud-python>=0.5.0", ] [build-system] requires = ["setuptools>=61.0"] build-backend = "setuptools.build_meta" ``` ## Step 5: Docker Configuration Create the Dockerfile: ```dockerfile Dockerfile FROM python:3.11-slim # Prevent Python from buffering output ENV PYTHONUNBUFFERED=1 \ PYTHONDONTWRITEBYTECODE=1 WORKDIR /app # Copy package files COPY pyproject.toml ./ COPY src/ ./src/ # Install in editable mode for development RUN pip install --no-cache-dir -e . # Run as a module CMD ["python", "-m", "hud_controller.server"] ``` ## Step 6: Build and Test ### Option 1: Traditional Build & Test Build your Docker image: ```bash docker build -t todo-env . ``` Test with HUD CLI: ```bash # Run the 5-phase debug check hud debug todo-env # If that passes, analyze the tools hud analyze todo-env ``` ### Option 2: Hot-Reload Development (Recommended) Use the development proxy for instant feedback: ```bash # Start development server with hot-reload hud dev --build # In another terminal, test your environment hud analyze hud-todo-env:dev ``` Now you can edit `src/hud_controller/server.py` and see changes instantly without rebuilding! Your environment should pass all 5 debug phases and show the available tools ## Step 7: Create Tasks Now create tasks for agents to solve: ```python from hud.datasets import Task tasks = [ Task( prompt="Add three TODOs: buy groceries, call mom, finish report", mcp_config={ "local": { "command": "docker", "args": ["run", "--rm", "-i", "todo-env:latest"] } }, setup_tool={"name": "setup"}, evaluate_tool={ "name": "evaluate", "arguments": {"target_completed": 3} } ), Task( prompt="Complete all TODOs that contain the word 'urgent'", mcp_config={ "local": { "command": "docker", "args": ["run", "--rm", "-i", "todo-env:latest"] } }, setup_tool={ "name": "setup", "arguments": { "initial_todos": [ "urgent: fix bug", "buy coffee", "urgent: submit report" ] } }, evaluate_tool={ "name": "evaluate", "arguments": {"target_completed": 2} } ) ] ``` ## Step 8: Test with an Agent Run your first evaluation: ```python import asyncio import hud from hud.agents import ClaudeAgent async def test_todo_env(): task = tasks[0] # First task with hud.trace("todo-test"): agent = ClaudeAgent() result = await agent.run(task) print(f"Success: {result.reward == 1.0}") print(f"Details: {result.info}") asyncio.run(test_todo_env()) ``` ## Publishing Your Environment Once tested, publish to Docker Hub: ```bash # Tag your image docker tag todo-env your-dockerhub/todo-env:v1.0 # Push to registry docker push your-dockerhub/todo-env:v1.0 ``` ### Using Your Environment Remotely Others can now use your environment at scale: ```python import os from hud.datasets import Task from hud.agents import ClaudeAgent # Remote MCP configuration task = Task( prompt="Create 3 todos and complete the urgent ones", mcp_config={ "hud": { "url": "https://mcp.hud.so/v3/mcp", "headers": { "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}", "Mcp-Image": "your-dockerhub/todo-env:v1.0" } } }, setup_tool={"name": "setup"}, evaluate_tool={"name": "evaluate", "arguments": {"target_completed": 2}} ) # Run with any agent agent = ClaudeAgent() result = await agent.run(task) ``` ## Next Steps This tutorial covered the basics. For production environments, explore: Full 5-step process with advanced patterns Hub pattern, telemetry, and production tips The official [environments README](https://github.com/hud-evals/hud-python/blob/main/environments/README.md) contains comprehensive implementation details, troubleshooting tips, and reference examples. # Advanced Design Source: https://docs.hud.so/build-environments/registry Production patterns for MCP environments Learn advanced patterns for building production-ready MCP environments. These techniques are used in environments like `remote_browser` and enable sophisticated agent interactions. ## The Hub Pattern The hub pattern uses BaseHub, which is actually an inner MCP server that can be mounted to your main server. This provides namespace separation and modular organization. ### How Hubs Work A hub is an MCP server that exposes a single tool. When you call this tool, you specify which function within the hub to execute: ```json { "name": "setup", "arguments": { "name": "navigate_to_page", "arguments": { "url": "https://example.com" } } } ``` The outer `name` is the hub tool, inner `name` is the specific function, and inner `arguments` are passed to that function. ### Implementation ```python # setup/__init__.py from hud.tools.base import BaseHub setup = BaseHub("setup") # Import all setup modules from . import navigation, data, auth __all__ = ["setup"] ``` ```python # setup/navigation.py from . import setup @setup.tool() async def navigate_to_page(url: str): """Navigate browser to specified URL.""" env = setup.env # Access shared environment await env.browser.goto(url) return {"status": "success", "url": url} @setup.tool() async def reset_browser(): """Clear cookies and return to blank page.""" env = setup.env await env.browser.clear_cookies() await env.browser.goto("about:blank") ``` ```python # server.py from .setup import setup as setup_hub from .evaluate import evaluate as evaluate_hub mcp = MCPServer(name="my-environment") @mcp.initialize async def initialize_environment(ctx): # Create environment context env = await create_environment() # Set environment on hubs setup_hub.env = env evaluate_hub.env = env # Mount hubs to server mcp.mount(setup_hub) mcp.mount(evaluate_hub) ``` Benefits: * **Modular organization**: Each function in its own file * **Namespace separation**: Tools grouped logically * **Shared state**: Access environment via `hub.env` * **Auto-discovery**: Import registers all decorated functions ## Progress Notifications For long-running initialization, provide real-time feedback: ```python @mcp.initialize async def initialize_environment(ctx): """Initialize with progress updates.""" # Extract progress token from context progress_token = getattr(ctx.meta, "progressToken", None) if ctx.meta else None async def send_progress(progress: int, message: str): if progress_token: await ctx.session.send_progress_notification( progress_token=progress_token, progress=progress, total=100, message=message ) await send_progress(10, "Starting services...") await start_x11_server() await send_progress(40, "Launching browser...") browser = await launch_browser() await send_progress(70, "Connecting to cloud provider...") await connect_provider() await send_progress(100, "Environment ready!") ``` This shows live in the HUD platform during container startup. ## Dynamic Configuration via Agent Metadata Agents can pass metadata that gets injected into the MCP server's initialization context using `patch_mcp_config`: ```python # In your agent from hud.agents import ClaudeAgent agent = ClaudeAgent( metadata={ "display_width": 1280, "display_height": 720, "browser_provider": "anchorbrowser" } ) # This metadata is patched into mcp_config via patch_mcp_config # See hud/agents/base.py#L476-L479 ``` ### How It Works The agent's `_setup_config` method uses `patch_mcp_config` to inject metadata: ```python from hud.utils.mcp import MCPConfigPatch, patch_mcp_config async def _setup_config(self, mcp_config: dict[str, dict[str, Any]]) -> None: """Inject metadata into the metadata of the initialize request.""" if self.metadata: patch_mcp_config( mcp_config, MCPConfigPatch(meta=self.metadata), ) ``` ### Using Metadata in Your Environment ```python @mcp.initialize async def initialize_environment(ctx): # Access metadata from context metadata = ctx.meta # Adjust display resolution width = metadata.get("display_width", 1920) height = metadata.get("display_height", 1080) # Select browser provider provider = metadata.get("browser_provider", "chromium") # Configure performance headless = metadata.get("headless", False) # Apply configuration browser_config = { "viewport": {"width": width, "height": height}, "headless": headless, "args": ["--disable-gpu"] if headless else [] } browser = await launch_browser(browser_config) ``` See [patch\_mcp\_config implementation](https://github.com/hud-evals/hud-python/blob/main/hud/utils/mcp.py#L17-L35) for details. ## Live Telemetry Expose real-time data via resources: ```python @mcp.resource("telemetry://live") async def get_telemetry(): """Provide live environment status.""" return { "status": "running", "live_url": get_live_view_url(), # VNC or browser URL "metrics": { "memory_usage": get_memory_usage(), "active_connections": count_connections(), "uptime": get_uptime() }, "logs": get_recent_logs(lines=50) } ``` This appears in the HUD platform UI for debugging. ## VNC Server Integration For GUI environments, provide visual access: ```python # Start from base image with VNC FROM hudpython/novnc-base:latest # Your GUI app installation RUN apt-get update && apt-get install -y your-gui-app # VNC is already configured on ports 5901 (VNC) and 8080 (noVNC) ``` In your MCP server: ```python @mcp.resource("telemetry://live") async def get_telemetry(): # Provide noVNC URL for browser viewing return { "live_url": "http://localhost:8080/vnc.html", "vnc_port": 5901, "password": "vncpass" # Set via VNC_PW env var } ``` ## Using Base Images HUD provides optimized base images: ### hudpython/novnc-base * Ubuntu with XFCE desktop * VNC and noVNC pre-configured * Common utilities installed ```dockerfile FROM hudpython/novnc-base:latest # Add your GUI application RUN apt-get update && apt-get install -y \ libreoffice \ firefox \ your-app # VNC auto-starts with your app ``` ### hudpython/browser-base * Chromium and Playwright installed * Optimized for browser automation ```dockerfile FROM hudpython/browser-base:latest # Add your browser automation code COPY src/ /app/src/ CMD ["python", "-m", "src.server"] ``` ## Phase 6: Automated Improvement Enable Cursor Agent to improve your environment: ### 1. Development Configuration Use [reloaderoo](https://github.com/cameroncooke/reloaderoo) for hot-reload development: ```json // .cursor/mcp.json { "mcpServers": { "my-env-dev": { "command": "npx", "args": [ "reloaderoo", "--", // Enables hot-reload without restarting client "docker", "run", "-i", "--rm", "-v", "$(pwd)/src:/app/src:rw", "-e", "PYTHONDEBUG=1", "my-env:dev" ] } } } ``` ### 2. Cursor Rules Create `.cursor/rules/mcp_environment_iteration.mdc`: ```markdown --- description: Improve an MCP environment alwaysApply: false --- Setup 1. Ensure dev mode MCP config exists 2. Explore environment structure 3. Clarify improvement objectives Iteration 1. Test current functionality via tools 2. Note errors and improvements in TODOs 3. Fix issues in source code 4. Call restart_server to reload 5. Verify fixes work correctly 6. Continue until all TODOs complete ``` ### 3. Improvement Workflow Prompt Cursor: ``` Context: I have a browser environment in my-env folder Tools: Use my-env-dev to interact with it Task: Test all evaluators and fix any that fail ``` The agent will: 1. Explore available tools 2. Test each function 3. Fix issues in your code 4. Reload and verify 5. Continue iterating ## Example: Production Browser Environment Here's how `remote_browser` implements these patterns: ```python # Modular structure src/hud_controller/ ├── server.py # Main MCP server with progress ├── providers/ # Cloud browser integrations ├── setup/ # Hub with navigation, cookies ├── evaluate/ # Hub with page checks └── tools/ # Playwright and computer tools # Advanced features - Progress notifications during startup - Live browser URL in telemetry - Dynamic resolution from metadata - Multiple provider support - Automatic cleanup on shutdown ``` ## Next Steps Complete guide with all implementation details Study production environments # Architecture Source: https://docs.hud.so/core-concepts/architecture How HUD connects agents, environments, and evaluation HUD is built as a thin orchestration layer on top of MCP, providing structure for agent evaluation, environment management, and telemetry. ## System Overview ```mermaid graph TB subgraph "Remote" Dashboard["📊 Dashboard
(app.hud.so)"] API["🔌 MCP Orchestrator
(mcp.hud.so)"] end subgraph "Your Code" Agent["🤖 Agent
(Claude/Operator)"] Task["📋 Task
(Prompt + Evaluation)"] SDK["📦 hud-python"] end subgraph "Environments" LocalEnv["🖥️ Local Docker
(Development)"] RemoteEnv["☁️ Remote Docker
(100s Parallel)"] end subgraph "OpenTelemetry" Trace["📡 Traces & Metrics"] end Dataset["📚 Dataset
(HuggingFace)"] AnyMCP["🔗 Any MCP Client
(Cursor, Claude, Custom)"] Agent <--> SDK Task --> SDK Dataset <-.-> Task SDK <-->|"MCP"| LocalEnv SDK <-->|"MCP"| API API <-->|"MCP"| RemoteEnv SDK -->|"hud.trace"| Trace Trace --> Dashboard AnyMCP -->|"MCP"| API style Dashboard fill:#e0e7ff,stroke:#6366f1,stroke-width:2px style SDK fill:#fef3c7,stroke:#f59e0b,stroke-width:2px style RemoteEnv fill:#d1fae5,stroke:#10b981,stroke-width:2px style AnyMCP fill:#fce7f3,stroke:#ec4899,stroke-width:2px,stroke-dasharray: 5 5 ``` ## Core Components ### 1. Agents (`hud.agents`) Agents make decisions and call tools: ```python from hud.agents import ClaudeAgent, OperatorAgent # Built-in agents agent = ClaudeAgent() # Uses Anthropic's Claude agent = OperatorAgent() # Uses OpenAI Operator # All agents inherit from MCPAgent class MCPAgent: async def initialize(task: str | Task | None) async def run(prompt_or_task: str | Task, max_steps: int = 10) -> Trace async def call_tools(tool_calls: list[MCPToolCall]) -> list[MCPToolResult] ``` Agents can auto-create MCP clients from `task.mcp_config` - no manual client setup needed ### 2. Tasks (`hud.Task`) Tasks define what agents should accomplish: ```python from hud.datasets import Task task = Task( prompt="Complete 3 items in the TODO list", mcp_config={ "hud": { "url": "https://mcp.hud.so/v3/mcp", "headers": { "Authorization": f"Bearer {api_key}", "Mcp-Image": "hudpython/hud-browser:latest" } } }, # Tool names and arguments map directly to MCP server tools setup_tool={ "name": "setup", # Calls def setup(name, num_items) on the environment "arguments": { "name": "todo_seed", "num_items": 5 } }, evaluate_tool={ "name": "evaluate", # Calls def evaluate(name, expected_count) on the environment "arguments": { "name": "todo_completed", "expected_count": 3 } } ) ``` The `name` and `arguments` in setup/evaluate tools correspond exactly to the tool names and parameters exposed by the MCP server ### 3. MCP Clients (`hud.clients`) Clients handle the MCP protocol: ```python from hud.clients import MCPClient # Auto-created by agents, or manual: client = MCPClient(mcp_config) await client.initialize() tools = await client.list_tools() ``` ### 4. Environments Environments are MCP servers exposing tools: ```python from hud.server import MCPServer server = MCPServer("my-env") @server.tool() def move(direction: str): # Environment logic return result ``` ### 5. Telemetry (`hud.trace`) Real-time observability: ```python import hud with hud.trace("my-evaluation"): result = await agent.run(task) # View at app.hud.so/traces/{trace_id} ``` ## Execution Flow Create a `Task` with prompt and MCP configuration Agent creates MCP client (if needed) and connects to environment Execute `setup_tool` to initialize environment state Agent receives observations, makes decisions, calls tools Execute `evaluate_tool` to score performance All interactions streamed to HUD backend for analysis ## Key Design Principles 1. **Protocol-First**: Everything speaks MCP 2. **Composable**: Mix and match agents, environments, evaluations 3. **Observable**: Built-in telemetry for every interaction 4. **Testable**: Reproducible evaluations with Docker 5. **Extensible**: Easy to add new agents or environments The `MCPServer` class wraps FastMCP with lifecycle management, making it easy to build Docker-based environments ## Next Steps Test and benchmark agents on standardized tasks Create MCP-compatible environments for your software # MCP Protocol Source: https://docs.hud.so/core-concepts/mcp-protocol Understanding the Model Context Protocol that connects agents to environments The Model Context Protocol (MCP) is the skeleton hud uses to provide a standard way for AI agents to interact with any software environment through tool calls. ## Why MCP? Traditional agent frameworks couple agents tightly to specific environments. MCP decouples them: * Agent code hardcoded for each environment * No standardization across tools * Difficult to swap agents or environments * Any agent works with any environment * Standard protocol for all interactions * Easy to swap components ## How It Works MCP standardizes agent-environment communication through JSON-RPC messages. Agents call tools exposed by environments and receive structured responses. ## Core Concepts ### Tools [Tools](https://modelcontextprotocol.io/specification/server/tools) are functions exposed by the environment: ```json { "name": "move", "description": "Move in a direction", "inputSchema": { "type": "object", "properties": { "direction": { "type": "string", "enum": ["up", "down", "left", "right"] } } } } ``` ### Tool Calls & Results Agents [call tools](https://modelcontextprotocol.io/specification/server/tools#calling-tools) and receive results: ```python # Agent makes a tool call result = await client.call_tool("move", {"direction": "up"}) # Environment returns result { "content": [{ "type": "text", "text": "Moved up. New board state: ..." }] } ``` ### Lifecycle Management MCP defines a [rigorous lifecycle](https://modelcontextprotocol.io/specification/basic/lifecycle) for connections: 1. **Initialization**: Client and server negotiate capabilities and protocol version with `client.initialize()` 2. **Operation**: Normal tool calling and message exchange 3. **Shutdown**: Clean termination of the connection The protocol ensures both sides understand each other's capabilities before proceeding. ## HUD's MCP Extensions HUD adds conventions on top of MCP: 1. **Setup Tools**: Initialize environment state (`setup_board`, `navigate_to_url`) 2. **Evaluate Tools**: Score agent performance (`evaluate_max_tile`, `contains_text`) 3. **Lifecycle Management**: Clean initialization and shutdown with `client.initialize()` and proper cleanup See the [Tools Reference](/reference/tools) for implementation details. ## Transport Options HUD environments are designed for 100% reproducibility through Docker: Run environments locally for development and debugging using [stdio transport](https://modelcontextprotocol.io/specification/basic/transports#stdio): ```python mcp_config = { "hud-text-2048": { "command": "docker", "args": ["run", "--rm", "-i", "hudpython/hud-text-2048:v1.2"] } } ``` * **Transport**: [stdio](https://modelcontextprotocol.io/specification/basic/transports#stdio) - JSON-RPC over stdin/stdout * **Pros**: Full control, easy debugging, no network latency * **Use case**: Development, testing, single-agent runs The `-i` flag enables Docker's interactive mode, allowing stdio communication between the client and server process. Scale to parallel runs with cloud infrastructure using [Streamable HTTP transport](https://modelcontextprotocol.io/specification/basic/transports#streamable-http): ```python mcp_config = { "hud": { "url": "https://mcp.hud.so/v3/mcp", "headers": { "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}", "Mcp-Image": "hudpython/hud-text-2048:v1.2" } } } ``` * **Transport**: [Streamable HTTP](https://modelcontextprotocol.io/specification/basic/transports#streamable-http) with SSE * **Pros**: Infinite parallelization, no local setup, managed infrastructure * **Use case**: Benchmarks, CI/CD, multi-agent experiments HUD's orchestrator manages Docker containers in the cloud, exposing them via HTTP endpoints with Server-Sent Events for streaming. Both approaches use the exact same Docker image, ensuring identical behavior whether running locally or remotely ## Next Steps See how HUD builds on MCP for agent evaluation # Task System Source: https://docs.hud.so/core-concepts/task-system How tasks define agent objectives and evaluation criteria Tasks define what agents should do and how to measure success. They combine a prompt, environment configuration, and optional setup/evaluation phases. ## Lifecycle ```mermaid graph LR Task["📋 Task
(Prompt + Config)"] Setup["🔧 Setup Phase
(Initialize State)"] Execute["🤖 Execute Phase
(Agent Works)"] Evaluate["📊 Evaluate Phase
(Score Result)"] Task --> Setup Setup --> Execute Execute --> Evaluate style Task fill:#e0e7ff,stroke:#6366f1,stroke-width:2px style Execute fill:#fef3c7,stroke:#f59e0b,stroke-width:2px style Evaluate fill:#d1fae5,stroke:#10b981,stroke-width:2px ``` ## Task Structure ```python from hud.datasets import Task import uuid task = Task( # Required fields prompt="Navigate to the login page and sign in as testuser@example.com", mcp_config={ "hud": { "url": "https://mcp.hud.so/v3/mcp", "headers": { "Authorization": "Bearer ${HUD_API_KEY}", "Mcp-Image": "hudpython/hud-browser:latest" } } }, # Optional fields id=str(uuid.uuid4()), # Required for HuggingFace datasets system_prompt="You are an expert web automation agent. Always verify page loads before interacting with elements.", setup_tool={ "name": "playwright", "arguments": { "action": "navigate", "url": "https://example.com" } }, evaluate_tool={ "name": "evaluate", "arguments": { "name": "url_contains", "substring": "/dashboard" } }, metadata={"category": "authentication", "difficulty": "easy"} ) ``` ## Field Reference The instruction given to the agent. Be specific and clear about success criteria. Environment connection configuration. Supports environment variables with `${VAR}` syntax. Unique identifier. Required when creating HuggingFace datasets. Use `str(uuid.uuid4())`. Custom system prompt for the agent. Overrides the agent's default system prompt. Tool(s) to initialize environment state with `name` and `arguments` Tool(s) to score performance with `name` and `arguments` Additional task information for analysis and filtering ## Environment Variables Tasks automatically resolve `${VAR}` patterns when deserialized from dictionaries. This is why HuggingFace datasets should store raw dictionaries, not Task objects. ```python # In dataset JSON: { "prompt": "Complete the TODO list", "mcp_config": { "hud": { "url": "https://mcp.hud.so/v3/mcp", "headers": { "Authorization": "Bearer ${HUD_API_KEY}", "Mcp-Image": "${BROWSER_IMAGE}" } } } } # When loaded: task = Task(**task_dict) # Variables resolved here! # Now task.mcp_config["hud"]["headers"]["Authorization"] = "Bearer sk-hud-..." ``` This enables: * **Public datasets** without exposing secrets * **Environment-specific** configurations * **CI/CD pipelines** with different credentials ## Running Tasks ```python # Agent automatically handles all phases result = await agent.run(task) print(f"Success: {result.reward}") # 0.0 to 1.0 ``` The agent will: 1. Execute `setup_tool` if provided 2. Work on the `prompt` using available tools 3. Execute `evaluate_tool` to calculate reward ## Working with Datasets Tasks integrate with HuggingFace datasets: ```python from datasets import load_dataset from hud.datasets import run_dataset # Load dataset from HuggingFace dataset = load_dataset("hud-evals/sheetbench-50", split="train") # Run agent on entire dataset results = await run_dataset( "SheetBench Run", dataset, # Pass the dataset object or name (e.g. "hud-evals/sheetbench-50") agent_class=ClaudeAgent, max_concurrent=5 ) ``` ## Creating Datasets ```python from hud.datasets import save_tasks # Create task dictionaries (NOT Task objects!) task_dicts = [ { "prompt": "Navigate to the login page", "mcp_config": { "hud": { "url": "${MCP_URL}", "headers": {"Authorization": "Bearer ${HUD_API_KEY}"} } }, "setup_tool": {"name": "playwright", "arguments": {"action": "navigate", "url": "https://example.com"}}, "evaluate_tool": {"name": "url_match", "arguments": {"pattern": ".*/login"}} }, # More task dicts... ] # Save to HuggingFace (preserves ${VAR} templates) save_tasks(task_dicts, "my-org/my-benchmark") ``` Always save dictionaries, not Task objects. Task objects have already resolved environment variables! ## Best Practices 1. **Use UUIDs**: Always include `id=str(uuid.uuid4())` for dataset tasks 2. **Clear Prompts**: Be specific about success criteria 3. **Template Variables**: Use `${VAR}` syntax for shareable configs 4. **Rich Tools**: Include both `name` and `arguments` in tool definitions ## Next Steps Use tasks with standard benchmarks Create your own evaluation scenarios # Create Agents Source: https://docs.hud.so/evaluate-agents/create-agents Build your own MCP-compatible agent for HUD evaluation Create custom agents that work with HUD's evaluation system. You can either extend the `MCPAgent` class for full control, or use the quickstart template for a simpler approach. ## Quick Start with Template The fastest way to create an agent is using the HUD quickstart template: ```bash # Clone the quickstart repository uvx hud-python quickstart # This creates a full agent implementation with: # - Complete lifecycle example # - Direct tool call handling # - Ready-to-run code ``` The quickstart provides a working agent that you can modify to add your own logic for making tool calls. ## Building from Scratch For full control, create an agent by inheriting from `MCPAgent` and implementing four core methods: ```python from hud.agents import MCPAgent from hud.types import AgentResponse, MCPToolCall, MCPToolResult class MyAgent(MCPAgent): """Your custom agent implementation.""" async def get_system_messages(self) -> list[Any]: """Return system messages for your LLM.""" pass async def get_response(self, messages: list[Any]) -> AgentResponse: """Generate agent response with tool calls.""" pass async def format_blocks(self, blocks: list[ContentBlock]) -> list[Any]: """Format content blocks for your LLM.""" pass async def format_tool_results( self, tool_calls: list[MCPToolCall], tool_results: list[MCPToolResult] ) -> list[Any]: """Format tool results back into messages.""" pass ``` ## Understanding the Interface The methods provide a bridge between two worlds: * **`messages: list[Any]`** - Your LLM's native message format (e.g., OpenAI's format) * **`blocks: list[ContentBlock]`** - MCP's content blocks (text, images, etc.) * **`tool_calls: list[MCPToolCall]`** and **`tool_results: list[MCPToolResult]`** - Native MCP types These methods translate between your LLM's format and MCP's standardized format. ## Implementation Guide ### 1. Get System Messages Define the system prompt for your agent: ```python async def get_system_messages(self) -> list[Any]: """ Return system messages in your LLM's format. """ # Use the configured system prompt or default prompt = self.system_prompt or "You are a helpful assistant." # Return in your LLM's expected format return [{"role": "system", "content": prompt}] ``` ### 2. Get Response Method This is where your agent decides what to do: ```python async def get_response(self, messages: list[Any]) -> AgentResponse: """ Call your LLM and return tool calls. Args: messages: Conversation history in your LLM's format Returns: AgentResponse with content and tool_calls """ # Call your LLM API response = await self.llm_client.chat( messages=messages, tools=self._available_tools # Provided by MCPAgent ) # Parse response into tool calls tool_calls = [] if response.tool_calls: for tc in response.tool_calls: tool_calls.append( MCPToolCall( name=tc.function.name, arguments=json.loads(tc.function.arguments), id=tc.id # Optional tool call ID ) ) return AgentResponse( content=response.content, tool_calls=tool_calls ) ``` ### 3. Format Blocks Method Convert MCP's content blocks into your LLM's message format: ```python async def format_blocks(self, blocks: list[ContentBlock]) -> list[Any]: """ Format content blocks from tools into messages. Args: blocks: List of MCP ContentBlock objects Returns: Messages in your LLM's expected format """ # Example for OpenAI-style format content = [] for block in blocks: if block.type == "text": content.append({"type": "text", "text": block.text}) elif block.type == "image": content.append({ "type": "image_url", "image_url": {"url": f"data:{block.mime_type};base64,{block.data}"} }) return [{"role": "user", "content": content}] ``` ### 4. Format Tool Results Method Convert tool execution results back into messages: ```python async def format_tool_results( self, tool_calls: list[MCPToolCall], tool_results: list[MCPToolResult] ) -> list[Any]: """ Format tool results for the next LLM call. Args: tool_calls: The MCP tool calls that were made tool_results: MCP results from executing those tools Returns: Messages to append to conversation """ messages = [] for call, result in zip(tool_calls, tool_results): # Add the assistant's tool call messages.append({ "role": "assistant", "content": None, "tool_calls": [{ "id": call.id or str(uuid.uuid4()), "type": "function", "function": { "name": call.name, "arguments": json.dumps(call.arguments) } }] }) # Add the tool result messages.append({ "role": "tool", "tool_call_id": call.id, "content": json.dumps(result.content) }) return messages ``` ## Custom Agent Implementation To create a custom agent, implement the `MCPAgent` protocol by defining these four core methods: ```python from hud.agents import MCPAgent from hud.types import AgentResponse, MCPToolCall, MCPToolResult from typing import Any import json class CustomAgent(MCPAgent): """Custom agent implementation for any LLM provider.""" def __init__(self, model: str = "your-model", **kwargs): super().__init__(**kwargs) self.model = model # Initialize your LLM client here # self.client = YourLLMClient() async def get_system_messages(self) -> list[Any]: """Return system messages for your LLM. This method should return the system prompt and any other initialization messages in your LLM's expected format. """ prompt = self.system_prompt or "You are a helpful assistant that uses tools to complete tasks." return [{"role": "system", "content": prompt}] async def get_response(self, messages: list[Any]) -> AgentResponse: """Generate agent response with tool calls. This is where you call your LLM and parse the response into tool calls. Convert available MCP tools to your LLM's format. """ # Convert MCP tools to your LLM's tool format tools = [ { "name": tool.name, "description": tool.description, "parameters": tool.inputSchema } for tool in self._available_tools ] # Call your LLM with tools response = await self.call_llm(messages, tools) # Parse tool calls from response tool_calls = self.parse_tool_calls(response) return AgentResponse( content=response.get("content", ""), tool_calls=tool_calls ) async def format_blocks(self, blocks: list[Any]) -> list[Any]: """Format content blocks for your LLM. Convert MCP content blocks (text, images) into your LLM's message format. """ messages = [] for block in blocks: if block.type == "text": messages.append({"role": "user", "content": block.text}) elif block.type == "image": # Handle image blocks if your LLM supports them messages.append({ "role": "user", "content": [ {"type": "image", "image": block.data} ] }) return messages async def format_tool_results( self, tool_calls: list[MCPToolCall], tool_results: list[MCPToolResult] ) -> list[Any]: """Format tool results for your LLM. Convert tool call results back into your LLM's message format. """ messages = [] # Add assistant message with tool calls messages.append({ "role": "assistant", "tool_calls": [ { "id": call.id, "name": call.name, "arguments": call.arguments } for call in tool_calls ] }) # Add tool results for call, result in zip(tool_calls, tool_results): content = json.dumps(result.content) if result.content else "" messages.append({ "role": "tool", "tool_call_id": call.id, "content": content }) return messages # Helper methods you'll need to implement async def call_llm(self, messages: list[Any], tools: list[Any]) -> dict: """Call your LLM API and return the response.""" # Implement your LLM API call here pass def parse_tool_calls(self, response: dict) -> list[MCPToolCall]: """Parse tool calls from your LLM's response format.""" # Implement parsing logic for your LLM's tool call format pass ``` ## Testing Your Agent Test your agent on a simple task: ```python import asyncio import hud import os from hud.datasets import Task async def test_agent(): with hud.trace("test-custom-agent"): task = Task( prompt="Navigate to example.com", mcp_config={ "hud": { "url": "https://mcp.hud.so/v3/mcp", "headers": { "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}", "Mcp-Image": "hudpython/hud-remote-browser:latest" } } }, setup_tool={ "name": "setup", "arguments": { "name": "navigate", "arguments": {"url": "https://example.com"} } }, evaluate_tool={ "name": "evaluate", "arguments": { "name": "url_match", "arguments": {"pattern": "example.com"} } } ) # Use your custom agent agent = CustomAgent(model="your-model") result = await agent.run(task) print(f"Reward: {result.reward}") asyncio.run(test_agent()) ``` ## Built-in Agents For quick prototyping, HUD provides built-in agents: ```python from hud.agents import ClaudeAgent, OperatorAgent # Claude (Anthropic) claude_agent = ClaudeAgent( allowed_tools=["anthropic_computer"] ) # OpenAI openai_agent = OperatorAgent( allowed_tools=["openai_computer"] ) ``` These are great for getting started, but implementing your own agent gives you full control over the conversation flow, tool selection, and response formatting. ## Advanced Features ### Custom System Prompts ```python class MyAgent(MCPAgent): def __init__(self, **kwargs): super().__init__( system_prompt="You are an expert web automation agent.", **kwargs ) ``` ### Tool Filtering ```python # Only allow specific tools agent = MyAgent( allowed_tools=["click", "type", "playwright"], disallowed_tools=["execute_script"] ) ``` ### Response Agent The ResponseAgent is an OpenAI-powered helper that determines whether an agent should stop or continue based on the agent's messages. It's useful for handling ambiguous situations: ```python from hud.agents.misc import ResponseAgent # ResponseAgent analyzes agent messages like: # "I've completed the form. Should I submit it?" # And returns "STOP" or "CONTINUE" agent = MyAgent( response_agent=ResponseAgent() # Requires OPENAI_API_KEY ) ``` ## Next Steps Track and compare agent performance Build custom evaluation datasets # Create Benchmarks Source: https://docs.hud.so/evaluate-agents/create-benchmarks Build evaluation datasets using existing environments or custom setups Create benchmarks to evaluate agent capabilities systematically. HUD makes it easy to build evaluation datasets using existing environments or by creating entirely custom setups. ## Using Existing Environments The fastest way to create benchmarks is to leverage existing HUD environments and their evaluators. ### 1. Explore Available Evaluators Use `hud analyze` to discover what's available: ```bash $ hud analyze hudpython/hud-remote-browser:latest 📋 Evaluators: ├── url_match(pattern: str) - Check if URL matches pattern ├── page_contains(text: str, regex: bool = False) - Check page content ├── element_exists(selector: str) - Check if CSS selector exists ├── cookie_exists(name: str) - Check if cookie exists ├── sheet_contains(cell: str, value: str) - Check spreadsheet cell ├── sheets_cell_values(args: dict) - Verify multiple cell values └── ... more evaluators 📝 Setup Functions: ├── navigate(url: str) - Navigate to URL ├── cookies(action: str, ...) - Manage cookies ├── sheets(action: str, ...) - Spreadsheet operations └── ... more setup functions ``` ### 2. Create Tasks with Known Evaluators Build tasks using the discovered evaluators: ```python import uuid from hud.datasets import Task # Example: Web navigation benchmark web_tasks = [] # Task 1: Simple navigation web_tasks.append({ "id": str(uuid.uuid4()), "prompt": "Navigate to the documentation page", "mcp_config": { "hud": { "url": "https://mcp.hud.so/v3/mcp", "headers": { "Authorization": "Bearer ${HUD_API_KEY}", "Mcp-Image": "hudpython/hud-remote-browser:latest" } } }, "setup_tool": { "name": "setup", "arguments": { "name": "navigate", "arguments": {"url": "https://example.com"} } }, "evaluate_tool": { "name": "evaluate", "arguments": { "name": "url_match", "arguments": { "pattern": ".*/docs.*" } } }, "metadata": {"difficulty": "easy", "category": "navigation"} }) # Task 2: Form interaction web_tasks.append({ "id": str(uuid.uuid4()), "prompt": "Submit the contact form with the name 'Test User' and email 'test@example.com'", "setup_tool": { "name": "setup", "arguments": { "name": "navigate", "arguments": {"url": "https://example.com/contact"} } }, "evaluate_tool": { "name": "evaluate", "arguments": { "name": "page_contains", "arguments": { "text": "Thank you for your submission" } } }, "metadata": {"difficulty": "medium", "category": "forms"} }) ``` ### 3. Test Your Benchmark Test your benchmark with `run_dataset`: ```python from hud.datasets import run_dataset from hud.agents import ClaudeAgent # Test your tasks results = await run_dataset( "Web Navigation Test", web_tasks, # Your list of task dicts agent_class=ClaudeAgent, max_concurrent=2 ) # Check results success_rate = sum(r.reward > 0.5 for r in results) / len(results) print(f"Success rate: {success_rate:.2%}") ``` ## Publishing Benchmarks All benchmarks are stored as HuggingFace datasets. ### Save Your Dataset ```python from hud.datasets import save_tasks # Save to HuggingFace (requires HF token) save_tasks( web_tasks, # List of task dictionaries repo_id="my-org/web-navigation-benchmark", private=False, # Make it public tags=["web", "navigation", "automation"] ) ``` ### Manual Upload You can also create datasets manually: 1. Save tasks as JSON: ```python import json with open("benchmark.json", "w") as f: json.dump(form_tasks, f, indent=2) ``` 2. Upload to HuggingFace: * Go to [https://huggingface.co/new-dataset](https://huggingface.co/new-dataset) * Upload your JSON file * Add a README with benchmark description ### Leaderboard Creation Leaderboards are automatically created when the first scorecard is published for your dataset. Simply run an evaluation and create a scorecard at `app.hud.so/leaderboards/{your-dataset-id}`. ## Extending Existing Environments ### Explore Available Evaluators For deeper customization, explore existing evaluators and contribute new ones: ```python # Example: Contributing a new evaluator # In environments/remote_browser/src/hud_controller/evaluate/form_complete.py from hud.tools.types import EvaluationResult from . import evaluate @evaluate.tool("form_complete") async def form_complete(ctx, required_fields: list[str]): """Check if all required form fields are filled.""" # Access environment from the hub env = evaluate.env # BrowserEnvironmentContext or similar # Use environment methods page_content = await env.get_page_content() # Check each field (simplified example) filled_count = 0 for field in required_fields: if f'value="{field}"' in page_content or f'>{field}<' in page_content: filled_count += 1 success = filled_count == len(required_fields) return EvaluationResult( reward=1.0 if success else filled_count / len(required_fields), done=success, info={"filled": filled_count, "required": len(required_fields)} ) ``` Submit evaluators via pull requests to expand environment capabilities. ## Creating Custom Environments For complete control over evaluation logic, build your own environment. Complete guide to creating Docker-based MCP environments with custom evaluators ## Example: SheetBench-50 Here's an actual task from SheetBench-50: ```python # Example task from the dataset sheetbench_task = { "id": "6e4744c7-b2c9-4bb6-807e-2cc144a4e8c2", "prompt": "Calculate from the RawData tab the z-scores from the mean close price for each row. Return, starting in ANSWER!A1 and descending to ANSWER!A5, the 5 dates with the greatest absolute value of standard deviations from the mean", "mcp_config": { "hud": { "url": "https://mcp.hud.so/v3/mcp", "headers": { "Authorization": "Bearer ${HUD_API_KEY}", "Run-Id": "${RUN_ID}", "Mcp-Image": "hudpython/hud-remote-browser:v1.1" } } }, "setup_tool": { "name": "setup", "arguments": { "name": "sheets_from_xlsx", "arguments": { "file_url": "https://gahludmjcsmszgyufydt.supabase.co//storage/v1/object/public/sheetbench/c6ddeb9a-0c16-4f5e-8a06-f148ebb4be8a/setup_input_2.xlsx?" } } }, "evaluate_tool": { "name": "evaluate", "arguments": { "name": "sheets_cell_values", "arguments": { "args": { "A1": "1/12/2024", "A2": "1/10/2024", "A3": "1/15/2024", "A4": "1/11/2024", "A5": "1/17/2024" } } } }, "metadata": { "partial": True, "gold_file_url": "https://gahludmjcsmszgyufydt.supabase.co//storage/v1/object/public/sheetbench/c6ddeb9a-0c16-4f5e-8a06-f148ebb4be8a/gold_solution_2.xlsx?" } } # The dataset includes 50 such tasks covering: # - Formula creation (VLOOKUP, SUMIF, etc.) # - Data analysis (z-scores, correlations) # - Data manipulation (sorting, filtering) # - Chart creation # - Pivot tables ``` ## Best Practices 1. **Clear Instructions**: Write unambiguous prompts 2. **Isolated State**: Each task should be independent 3. **Measurable Success**: Use binary or graduated scoring 4. **Diverse Coverage**: Test different aspects of capability 5. **Difficulty Progression**: Include easy to hard tasks 6. **Metadata Tags**: Help users filter relevant tasks ## Next Steps Track and compare agent performance on your benchmarks Create custom MCP environments for unique evaluation scenarios Use your datasets to train better agents with RL # Evaluate Agents Source: https://docs.hud.so/evaluate-agents/index Test and benchmark your agents on standardized tasks HUD makes it easy to evaluate any MCP-compatible agent on a variety of tasks and benchmarks. Whether you're testing Claude, Operator, or your custom agent, the evaluation flow is consistent. ## Quick Start Evaluate an agent on a single task: ```python import hud import os from hud.datasets import Task from hud.agents import ClaudeAgent with hud.trace("eval-demo"): task = Task( prompt="Create a spreadsheet with quarterly sales data", mcp_config={ "hud": { "url": "https://mcp.hud.so/v3/mcp", "headers": { "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}", "Mcp-Image": "hudpython/hud-browser:latest" } } }, setup_tool={ "name": "playwright", "arguments": { "action": "navigate", "url": "https://sheets.google.com" } }, evaluate_tool={ "name": "evaluate", "arguments": { "name": "sheet_contains", "text": "Q1 Sales" } } ) agent = ClaudeAgent() result = await agent.run(task) print(f"Success: {result.reward > 0.5}") ``` ## What You Can Do ### 1. Create Tasks for Browser Automation ```python import os task = Task( prompt="Navigate to GitHub and star the HUD repository", mcp_config={ "hud": { "url": "https://mcp.hud.so/v3/mcp", "headers": { "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}", "Mcp-Image": "hudpython/hud-browser:latest" } } }, setup_tool={ "name": "setup", "arguments": { "name": "navigate", "url": "https://github.com" } }, evaluate_tool={ "name": "evaluate", "arguments": { "name": "url_contains", "substring": "hud-python" } } ) ``` ### 2. Run Existing Benchmarks ```python from datasets import load_dataset from hud.datasets import run_dataset # Load and run SheetBench-50 dataset = load_dataset("hud-evals/sheetbench-50", split="train") results = await run_dataset( "My SheetBench Run", dataset, agent_class=ClaudeAgent, max_concurrent=5 ) ``` ### 3. Make Your Agent Work with HUD To create a custom agent, inherit from `MCPAgent` and implement the required methods: ```python from hud.agents import MCPAgent from hud.types import AgentResponse, MCPToolCall class MyCustomAgent(MCPAgent): async def get_response(self, messages: list[Any]) -> AgentResponse: # Call your LLM and return tool calls ... async def format_blocks(self, blocks: list[Any]) -> list[Any]: # Format content blocks into messages for your LLM ... async def format_tool_results( self, tool_calls: list[MCPToolCall], tool_results: list[Any] ) -> list[Any]: # Format tool results back into messages ... # Now it works with any HUD dataset! ``` See complete examples and implementation details ## Available Benchmarks 50 real-world spreadsheet tasks testing data manipulation, formulas, and analysis Desktop automation tasks across Ubuntu applications Web navigation and interaction challenges (Coming Soon) Strategic planning in the 2048 game environment ## Exploring Environments Use `hud analyze` to discover available tools and evaluators: ```bash $ hud analyze hudpython/hud-browser:latest 🔍 Analyzing hudpython/hud-browser:latest... 📊 Environment Summary: ├── Tools: 15 available ├── Setup Functions: 8 available ├── Evaluators: 12 available └── Resources: 3 available 🛠️ Tools: ├── playwright(action: str, **kwargs) - Browser automation actions ├── click(selector: str) - Click element ├── type(selector: str, text: str) - Type text └── ... 12 more tools 📋 Evaluators: ├── url_contains(substring: str) - Check if URL contains text ├── page_contains(text: str, regex: bool = False) - Check page content ├── element_exists(selector: str) - Check if CSS selector exists ├── todo_completed(expected_count: int) - Verify TODO completion └── ... 8 more evaluators Run with --json for full details or pipe to grep for filtering. ``` ## Publishing to Leaderboards After running evaluations, view results on the leaderboard: ```python # Run evaluation results = await run_dataset( "Claude-3.5 SheetBench", dataset="hud-evals/sheetbench-50", agent_class=ClaudeAgent ) # Then visit: app.hud.so/leaderboards/hud-evals/sheetbench-50 # Click "My Jobs" to see your runs and create scorecards ``` ## Key Features * **Reproducible**: Docker environments ensure consistency * **Parallel**: Run multiple evaluations concurrently * **Observable**: Every tool call tracked with telemetry * **Extensible**: Easy to add new tasks or benchmarks ## Next Steps Build your own MCP-compatible agent Track and compare agent performance Build custom evaluation datasets # Leaderboards Source: https://docs.hud.so/evaluate-agents/leaderboards Track and compare agent performance on standardized benchmarks HUD automatically tracks all benchmark runs and provides leaderboards for comparing agent performance across teams and models. ## Running Benchmarks Execute evaluations on standard datasets: ```python from hud.datasets import run_dataset from hud.agents import ClaudeAgent # Run a benchmark results = await run_dataset( name="Claude-3.5 SheetBench", dataset="hud-evals/sheetbench-50", agent_class=ClaudeAgent, agent_config={"model": "claude-3-5-sonnet-20241022"}, max_concurrent=5 ) ``` ## Viewing Results ### On the Leaderboard Page After running, view your results: 1. Navigate to `app.hud.so/leaderboards/{dataset-name}` * Example: `app.hud.so/leaderboards/hud-evals/sheetbench-50` 2. Click "My Jobs" to see your evaluation runs 3. Select runs to analyze performance ### Local Analysis Calculate metrics locally: ```python # Basic metrics success_rate = sum(r.reward > 0.5 for r in results) / len(results) avg_reward = sum(r.reward for r in results) / len(results) avg_duration = sum(r.duration for r in results) / len(results) print(f"Success Rate: {success_rate:.2%}") print(f"Average Reward: {avg_reward:.3f}") print(f"Average Duration: {avg_duration:.1f}s") ``` ## Creating Scorecards Transform your evaluation run into a public scorecard: Creating a scorecard from evaluation results Execute your agent on the full dataset Go to `app.hud.so/leaderboards/{dataset-name}` Click "My Jobs" and choose your evaluation run Select your best runs (we compute statistics over multiple jobs) and create a scorecard Your scorecard appears on the public leaderboard Example leaderboard showing different models and their scores ## Available Benchmarks Spreadsheet manipulation and analysis tasks Desktop automation in Ubuntu environment Strategic planning and game playing Web navigation challenges (Coming Soon) ## Best Practices 1. **Agent Naming**: Ensure your agent class has a meaningful name - it will be displayed on the leaderboard 2. **Multiple Runs**: Select your best few runs when creating a scorecard (we compute statistics over them) 3. **Consistent Configuration**: Use the same model version and parameters for fair comparison 4. **Share Results**: Make your scorecards public to contribute to the community Leaderboards are automatically created when the first scorecard is published for a dataset ## Next Steps Build your own MCP-compatible agent Build custom evaluation datasets # Introduction Source: https://docs.hud.so/index OSS RL environment + evals toolkit. Test Claude, Operator, or custom agents on benchmarks like SheetBench and OSWorld Wrap any software in dockerized MCP for scalable and generalizable agent evaluation Use reinforcement learning and GRPO on evaluations to improve agent performance ## What is HUD? HUD connects AI agents to software environments using the Model Context Protocol (MCP). Whether you're evaluating existing agents, building new environments, or training models with RL, HUD provides the infrastructure. ```mermaid graph LR Agent["🤖 Any Agent
(Claude, Operator, etc.)"] MCP["🔌 MCP Protocol
(Tool Calls)"] Env["📦 Any Environment
(Browser, OS, etc.)"] Agent -->|"call_tool()"| MCP MCP -->|"click(x, y)"| Env Env -->|"screenshot"| MCP MCP -->|"get_response()"| Agent style Agent fill:#e0e7ff,stroke:#6366f1,stroke-width:2px style MCP fill:#fef3c7,stroke:#f59e0b,stroke-width:2px style Env fill:#d1fae5,stroke:#10b981,stroke-width:2px ``` ## Why HUD? * **🔌 MCP-native**: Any agent can connect to any environment * **📡 Live telemetry**: Debug every tool call at [app.hud.so](https://app.hud.so) * **🚀 Production-ready**: From local Docker to cloud scale * **🎯 Built-in benchmarks**: OSWorld-Verified, SheetBench-50, and more * **🔧 CLI tools**: Debug, analyze, create environments with `hud debug` and `hud analyze` Run your first agent evaluation with zero setup ```bash uvx hud-python quickstart ``` ## Quick Example ```python import asyncio, os, hud from hud.datasets import Task from hud.agents import ClaudeAgent async def main(): # Define evaluation task with remote MCP task = Task( prompt="Win a game of 2048 by reaching the 128 tile", mcp_config={ "hud": { "url": "https://mcp.hud.so/v3/mcp", "headers": { "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}", "Mcp-Image": "hudpython/hud-text-2048:v1.2" } } }, setup_tool={"name": "setup", "arguments": {"name": "board", "arguments": { "board_size": 4}}}, evaluate_tool={"name": "evaluate", "arguments": {"name": "max_number", "arguments": {"target": 64}}} ) # Run agent (auto-creates MCP client) agent = ClaudeAgent() result = await agent.run(task) print(f"Score: {result.reward}") asyncio.run(main()) ``` ## Community Star the repo and contribute Join our community ### Are you a startup building agents? [📅 Hop on a call](https://cal.com/jay-ram-z6st6w/demo) or [📧 founders@hud.so](mailto:founders@hud.so) # LLM Quickstart Source: https://docs.hud.so/llm-quickstart Add context about hud to any coding agent ## Setup ```bash claude mcp add --transport http docs-hud https://docs.hud.so/mcp ``` Add to MCP settings: ```json "docs-hud": { "url": "https://docs.hud.so/mcp" } ``` Or use one-click install: [![Install MCP Server](https://cursor.com/deeplink/mcp-install-dark.svg)](https://cursor.com/en/install-mcp?name=docs-hud-so\&config=eyJ1cmwiOiJodHRwczovL2RvY3MuaHVkLnNvL21jcCJ9) Copy paste the full llms-full.txt ## What you get Your AI assistant gains access to: * Complete HUD API reference and methods * Working code examples and implementations * Architecture patterns and best practices * Real-time updates as documentation evolves Try asking your assistant: "How do I create a custom agent in HUD?" or "Help me debug MCP tool calls" ## Next steps Get started with HUD in 3 minutes Create custom MCP environments # Quickstart Source: https://docs.hud.so/quickstart Run your first agent evaluation in 3 minutes Get up and running with HUD in minutes. This guide walks you through installation, basic setup, and running your first agent evaluation. ## Quick Clone The fastest way to get started: ```bash # Clone a complete example project with uv uvx hud-python quickstart ``` This sets you up with a working agent evaluation example you can run immediately. ## Installation ```bash pip install hud-python ``` ```bash # Includes AI providers and telemetry pip install "hud-python[agent]" ``` ```bash # Install CLI in isolated environment uv tool install hud-python ``` ## API Keys Set your API keys as environment variables: ```bash export HUD_API_KEY="sk-hud-..." # Get from app.hud.so export ANTHROPIC_API_KEY="sk-ant-..." # For Claude agents export OPENAI_API_KEY="sk-..." # For OpenAI agents ``` Create a `.env` file in your project root to manage API keys locally ## Your First Agent Run an agent on the 2048 game environment: ```python import asyncio, os import hud from hud.datasets import Task from hud.agents import ClaudeAgent async def main(): # The trace context captures ALL agent interactions within a "task run" # Everything inside this context shows up as one trace on app.hud.so with hud.trace("quickstart-2048"): # Define task with remote MCP environment task = Task( prompt="Win a game of 2048 by reaching the 128 tile", mcp_config={ "hud": { "url": "https://mcp.hud.so/v3/mcp", "headers": { "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}", "Mcp-Image": "hudpython/hud-text-2048:v1.2" } } }, setup_tool={"name": "setup", "arguments": {"name": "board", "arguments": { "board_size": 4}}}, evaluate_tool={"name": "evaluate", "arguments": {"name": "max_number", "arguments": {"target": 64}}} ) # Run agent (auto-creates MCP client from task.mcp_config) agent = ClaudeAgent() result = await agent.run(task) print(f"Max tile reached: {result.reward}") asyncio.run(main()) ``` The trace context ensures that the entire task run - from setup through evaluation - appears as one coherent trace on the platform ## What just happened? 1. **Task Definition**: We created a `Task` with: * A prompt telling the agent what to do * An `mcp_config` pointing to a remote MCP environment * Setup and evaluation tools to initialize and score the game 2. **Auto Client**: The agent automatically created an MCP client from `task.mcp_config` 3. **Telemetry**: The `trace` context captured all interactions for debugging 4. **Evaluation**: The `evaluate_max_tile` tool returned the highest tile as reward ## Next Steps Learn how agents connect to environments Test on SheetBench and OSWorld Create your own MCP environment ## CLI Quick Reference ```bash # Analyze any MCP environment hud analyze hudpython/hud-text-2048:v1.2 # Debug environment initialization hud debug hudpython/hud-text-2048:v1.2 # Start hot-reload development server hud dev . --build # Run HUD tools as MCP server (for Cursor/Claude) hud mcp ``` See the [CLI Reference](/reference/cli/overview) for detailed command documentation # Agents Source: https://docs.hud.so/reference/agents SDK reference for HUD agent classes The HUD SDK provides a base `MCPAgent` class and several pre-built agent implementations for interacting with MCP environments. ## Base Class ### MCPAgent ```python from hud.agents import MCPAgent ``` Abstract base class for all MCP-enabled agents. Handles tool discovery, filtering, and execution flow. **Constructor Parameters:** | Parameter | Type | Description | Default | | --------------------------- | ---------------- | ------------------------------------------ | -------------- | | `mcp_client` | `AgentMCPClient` | MCP client for server connections | `None` | | `allowed_tools` | `list[str]` | List of tool names to allow | `None` (all) | | `disallowed_tools` | `list[str]` | List of tool names to disallow | `[]` | | `lifecycle_tools` | `list[str]` | Tools hidden from agent (setup/evaluate) | `[]` | | `initial_screenshot` | `bool` | Capture screenshot before first prompt | `False` | | `system_prompt` | `str` | System prompt to use | Default prompt | | `append_tool_system_prompt` | `bool` | Append available tools to system | `False` | | `append_setup_output` | `bool` | Append setup tool output to initial prompt | `False` | | `model_name` | `str` | Model name for telemetry | `"mcp-agent"` | | `response_agent` | `ResponseAgent` | Optional auto-response handler | `None` | | `auto_trace` | `bool` | Enable automatic telemetry | `True` | **Key Methods:** ```python async def initialize(task: str | Task | None = None) -> None """Initialize agent with task-specific configuration.""" async def run(prompt_or_task: str | Task | dict[str, Any], max_steps: int = 10) -> Trace """Run agent with prompt or task. Returns Trace with results.""" async def call_tools(tool_call: MCPToolCall | list[MCPToolCall]) -> list[MCPToolResult] """Execute tool calls through MCP client.""" def get_available_tools() -> list[types.Tool] """Get filtered list of available tools (excludes lifecycle).""" def get_tool_schemas() -> list[dict] """Get tool schemas formatted for the model.""" ``` **Abstract Methods (must implement):** ```python async def get_system_messages() -> list[Any] """Get system prompt formatted for the model.""" async def get_response(messages: list[Any]) -> AgentResponse """Get model response including tool calls.""" async def format_blocks(blocks: list[ContentBlock]) -> list[Any] """Format content blocks into model messages.""" async def format_tool_results(tool_calls: list[MCPToolCall], tool_results: list[MCPToolResult]) -> list[Any] """Format tool results for the model.""" ``` **Class Variables:** * `metadata: dict[str, Any]` - Metadata injected into MCP initialize request **Auto-Client Creation:** If no `mcp_client` is provided but a `Task` with `mcp_config` is passed to `run()`, an MCPClient is automatically created and cleaned up. ## Pre-built Agents ### ClaudeAgent ```python from hud.agents import ClaudeAgent ``` Claude-specific implementation using Anthropic's API. **Constructor Parameters:** | Parameter | Type | Description | Default | | ------------------- | ---------------- | ----------------------- | ------------------------------ | | `model_client` | `AsyncAnthropic` | Anthropic client | Auto-created | | `model` | `str` | Claude model to use | `"claude-3-7-sonnet-20250219"` | | `max_tokens` | `int` | Maximum response tokens | `4096` | | `use_computer_beta` | `bool` | Use computer-use beta | `True` | **Features:** * Native Claude tool calling * Automatic prompt caching * Computer-use beta support * Display metadata injection (1280x720) **Example:** ```python agent = ClaudeAgent( model="claude-3-5-sonnet-20241022", max_tokens=8192 ) result = await agent.run( Task( prompt="Navigate to example.com", mcp_config={"server": {...}}, evaluate_tool={"name": "evaluate", "arguments": {...}} ) ) ``` ### OperatorAgent ```python from hud.agents import OperatorAgent ``` OpenAI's Operator agent implementation. **Constructor Parameters:** | Parameter | Type | Description | Default | | -------------- | ------------- | ----------------------- | --------------------------- | | `model_client` | `AsyncOpenAI` | OpenAI client | Auto-created | | `model` | `str` | Model to use | `"gpt-4o-realtime-preview"` | | `max_tokens` | `int` | Maximum response tokens | `4096` | **Features:** * OpenAI function calling * Operator system prompt * Display metadata injection (1920x1080) ### GenericOpenAIChatAgent ```python from hud.agents import GenericOpenAIChatAgent ``` Generic OpenAI chat completion agent for any OpenAI-compatible API. **Constructor Parameters:** | Parameter | Type | Description | Default | | -------------- | ------------- | ------------------------ | -------- | | `model_client` | `AsyncOpenAI` | OpenAI-compatible client | Required | | `model` | `str` | Model name | Required | | `max_tokens` | `int` | Maximum response tokens | `4096` | **Example:** ```python # Use with local LLM from openai import AsyncOpenAI client = AsyncOpenAI( base_url="http://localhost:11434/v1", # Ollama api_key="not-needed" ) agent = GenericOpenAIChatAgent( model_client=client, model="llama3.1" ) ``` ### LangChainAgent ```python from hud.agents import LangChainAgent ``` LangChain integration for using any LangChain-compatible model. **Constructor Parameters:** | Parameter | Type | Description | Default | | --------- | --------------- | -------------------- | -------- | | `model` | `BaseChatModel` | LangChain chat model | Required | **Example:** ```python from langchain_anthropic import ChatAnthropic model = ChatAnthropic(model="claude-3-opus-20240229") agent = LangChainAgent(model=model) ``` ### ArtHUDAgent ```python from hud.agents import ArtHUDAgent ``` Integration with ART (Automatic Reasoning and Tool-use) models. **Constructor Parameters:** | Parameter | Type | Description | Default | | ------------ | ---------- | ----------------------- | -------- | | `art_model` | `ArtModel` | ART model instance | Required | | `max_tokens` | `int` | Maximum response tokens | `4096` | **Features:** * Built-in Chain-of-Thought reasoning * Structured reasoning traces * Tool use optimization ## Helper Classes ### ResponseAgent Base class for auto-response handlers that decide when to continue or stop. ```python from hud.agents.misc import ResponseAgent class MyResponseAgent(ResponseAgent): async def determine_response(self, agent_output: str) -> str: if "task complete" in agent_output.lower(): return "STOP" return "Continue with the next step" ``` ## Common Types ### AgentResponse ```python from hud.types import AgentResponse @dataclass class AgentResponse: content: str | None = None tool_calls: list[MCPToolCall] | None = None done: bool = False ``` ### MCPToolCall ```python from hud.types import MCPToolCall class MCPToolCall(BaseModel): name: str arguments: dict[str, Any] = {} ``` ### MCPToolResult ```python from hud.types import MCPToolResult class MCPToolResult(BaseModel): content: list[ContentBlock] structuredContent: dict[str, Any] | None = None isError: bool = False ``` ### Trace ```python from hud.types import Trace class Trace(BaseModel): reward: float = 0.0 done: bool = True content: str | None = None isError: bool = False info: dict[str, Any] = {} steps: list[TraceStep] = [] ``` ## Usage Examples ### Simple Prompt Execution ```python from hud.agents import ClaudeAgent from hud.clients import MCPClient # Manual client creation client = MCPClient({ "server": { "command": "docker", "args": ["run", "-i", "my-env:latest"] } }) agent = ClaudeAgent(mcp_client=client) await agent.initialize() # Run with string prompt result = await agent.run("Click the submit button", max_steps=5) print(f"Result: {result.content}") print(f"Success: {not result.isError}") ``` ### Task Execution with Auto-Client ```python from hud.agents import OperatorAgent from hud.datasets import Task # No client needed - auto-created from task agent = OperatorAgent() task = Task( prompt="Find the price of the product", mcp_config={ "browser": { "url": "mcp://mcp.hud.so/v3/mcp", "headers": { "Authorization": "Bearer ${HUD_API_KEY}", "Mcp-Image": "hudpython/hud-browser:latest" } } }, setup_tool={ "name": "setup", "arguments": {"url": "https://example.com"} }, evaluate_tool={ "name": "evaluate", "arguments": {"check": "price_found"} } ) # Client created automatically result = await agent.run(task, max_steps=20) print(f"Reward: {result.reward}") ``` ### Custom Agent Implementation ```python from hud.agents import MCPAgent from hud.types import AgentResponse import hud class MyCustomAgent(MCPAgent): metadata = {"custom": "metadata"} async def get_system_messages(self) -> list[dict]: return [{ "role": "system", "content": self.system_prompt }] @hud.instrument(span_type="agent", record_args=False, record_result=True) async def get_response(self, messages: list[dict]) -> AgentResponse: # Your LLM call here response = await self.llm.chat(messages) return AgentResponse( content=response.content, tool_calls=[ MCPToolCall(name=tc.name, arguments=tc.args) for tc in response.tool_calls ], done=response.stop_reason == "stop" ) async def format_blocks(self, blocks: list[ContentBlock]) -> list[dict]: content = [] for block in blocks: if block.type == "text": content.append({"type": "text", "text": block.text}) elif block.type == "image": content.append({ "type": "image", "image": {"data": block.data, "format": "png"} }) return [{"role": "user", "content": content}] async def format_tool_results(self, tool_calls, tool_results) -> list[dict]: return [{ "role": "tool", "content": result.content, "tool_call_id": call.name } for call, result in zip(tool_calls, tool_results)] ``` ## See Also * [Create Agents](/evaluate-agents/create-agents) - Tutorial on building agents * [Tasks](/reference/tasks) - Task configuration reference * [Architecture](/core-concepts/architecture) - How agents fit in HUD # hud analyze Source: https://docs.hud.so/reference/cli/analyze Inspect MCP environments quickly from metadata or live containers The `hud analyze` command inspects MCP environments to discover their tools and capabilities. By default, it uses cached metadata for instant results. Use `--live` for real-time analysis. ## Usage ```bash hud analyze [OPTIONS] ``` ## Arguments Docker image, lock file path, or command to analyze ## Options Output format: `interactive`, `json`, or `markdown`. Short: `-f` Show full tool schemas and parameters. Short: `-v` Run container for live analysis (slower but more accurate) JSON config file with MCP configuration. Short: `-c` Analyze a server from Cursor config Analyze a local command instead of Docker image Timeout in seconds for live analysis ## Analysis Modes ### Fast Mode (Default) Uses cached metadata from: 1. Local lock file cache (`~/.hud/locks/`) 2. HUD registry (if available) 3. Basic Docker manifest info ```bash # Instant results from metadata hud analyze hudpython/text-analyzer:latest # From lock file hud analyze ./my-env.lock.yaml ``` Fast mode is perfect for quick inspection and doesn't require running containers. ### Live Mode Runs the actual container for comprehensive analysis: ```bash # Full analysis with running container hud analyze hudpython/text-analyzer:latest --live # With environment variables hud analyze my-env:latest --live -e API_KEY=test ``` Use `--live` when you need: * Real-time tool validation * Resource discovery * Telemetry information * Testing with specific env vars ## Output Formats ### Interactive (Default) Fast mode output: ``` 📊 Environment Overview ┌─────────────┬─────────────────────────┐ │ Image │ hudpython/text-2048 │ │ Source │ HUD Registry │ │ Built │ 2024-01-15T10:30:00Z │ │ HUD Version │ 0.1.0 │ │ Init Time │ 450 ms │ │ Tools │ 6 │ └─────────────┴─────────────────────────┘ 🔧 Available Tools └── Tools ├── setup - Initialize environment ├── evaluate - Return environment state ├── move - Move tiles in direction ├── reset_board - Reset game to start └── get_score - Get current score ``` Live mode output (`--live`): ``` 🔍 Analyzing MCP environment: hudpython/text-2048:latest 📊 Environment Overview ┌─────────────┬─────────────────────────┐ │ Server │ hud-text-2048 │ │ Initialized │ ✓ │ └─────────────┴─────────────────────────┘ 🔧 Available Tools ├── Regular Tools │ ├── move │ │ └── Move tiles in the specified direction │ ├── reset_board │ │ └── Reset the game board to initial state │ └── get_score │ └── Get the current game score └── Hub Tools └── game_hub ├── save_state └── load_state 📚 Available Resources ┌──────────────────┬────────────────┬─────────────┐ │ URI │ Name │ Type │ ├──────────────────┼────────────────┼─────────────┤ │ game://state │ Game State │ application │ │ game://history │ Move History │ text/plain │ └──────────────────┴────────────────┴─────────────┘ 📡 Telemetry Data ┌─────────┬─────────────────────────────┐ │ Live URL│ https://app.hud.so/xyz123 │ │ Status │ running │ │ Services│ 2/2 running │ └─────────┴─────────────────────────────┘ ``` ### JSON Format ```bash # Fast mode hud analyze my-env --format json { "image": "my-env:latest", "status": "from_cache", "tool_count": 6, "init_time": 450, "tools": [{ "name": "setup", "description": "Initialize environment" }] } # Live mode hud analyze my-env --format json --live { "metadata": { "servers": ["my-env"], "initialized": true }, "tools": [{ "name": "setup", "description": "Initialize environment", "input_schema": {...} }], "hub_tools": {...}, "resources": [...], "telemetry": {...} } ``` ### Markdown Format ```bash hud analyze my-env --format markdown > docs/tools.md ``` Generates formatted documentation with tool descriptions and schemas. ## Examples ### Fast Analysis (Default) ```bash # Quick inspection from metadata hud analyze hudpython/text-analyzer:latest # Analyze multiple environments rapidly for env in browser scraper analyzer; do hud analyze "hudpython/$env:latest" done # From lock file hud analyze ./environments/prod.lock.yaml # With verbose schemas hud analyze my-env:latest --verbose ``` ### Live Analysis ```bash # Full container analysis hud analyze my-env:latest --live # With environment variables hud analyze my-env:latest --live -e API_KEY=test -e DEBUG=true # Custom timeout for slow environments hud analyze heavy-env:latest --live --timeout 60 ``` ### Local Development ```bash # Analyze local command hud analyze --command "python my_server.py" --live # Cursor integration hud analyze --cursor my-dev-server --live # With config file hud analyze my-env:latest --config mcp-config.json ``` ### Documentation Generation ```bash # Generate tool documentation hud analyze my-env:latest --format markdown > docs/tools.md # Compare versions hud analyze my-env:v1.0 --format json > v1.json hud analyze my-env:v2.0 --format json > v2.json diff v1.json v2.json # Extract tool names hud analyze my-env --format json | jq -r '.tools[].name' ``` ## Performance Comparison | Mode | Speed | Use Case | | --------------- | ------------ | --------------------------- | | Fast (default) | \< 1 second | Quick inspection, CI checks | | Live (`--live`) | 5-30 seconds | Full validation, debugging | Use fast mode for rapid iteration during development. Switch to `--live` for final validation. ## CI/CD Integration ### Fast CI Checks ```yaml # GitHub Actions - Quick tool validation - name: Verify tools exist run: | TOOLS=$(hud analyze $IMAGE --format json | jq '.tool_count') if [ "$TOOLS" -lt 3 ]; then echo "Not enough tools!" exit 1 fi ``` ### Python Validation Script ```python #!/usr/bin/env python3 import subprocess import json def verify_environment(image, required_tools): """Fast validation using metadata""" # Quick check with metadata result = subprocess.run( ["hud", "analyze", image, "--format", "json"], capture_output=True, text=True, check=True ) data = json.loads(result.stdout) available = {tool["name"] for tool in data["tools"]} missing = set(required_tools) - available if missing: print(f"⚠️ Missing tools: {missing}") print("Running live analysis for details...") # Full check if needed subprocess.run( ["hud", "analyze", image, "--live"], check=True ) return len(missing) == 0 # Usage verify_environment("my-env:latest", ["setup", "evaluate", "process"]) ``` ## Troubleshooting The environment may not have been pushed to HUD registry: ```bash # Falls back to basic Docker info hud analyze my-env:latest --verbose # Force live analysis hud analyze my-env:latest --live ``` If metadata seems outdated: ```bash # Use live mode for current state hud analyze my-env:latest --live # Rebuild and push to update metadata hud build && hud push ``` Some environments take longer to initialize: ```bash # Increase timeout hud analyze heavy-env:latest --live --timeout 120 # Check with debug first hud debug heavy-env:latest --max-phase 1 ``` Ensure clean output: ```bash # Redirect stderr for clean JSON hud analyze my-env --format json 2>/dev/null # Or use jq for validation hud analyze my-env --format json | jq . ``` ## Best Practices 1. **Default to Fast Mode**: Start with metadata for quick checks 2. **Live for Validation**: Use `--live` before production deployments 3. **Cache Lock Files**: Share lock files for consistent metadata 4. **Version Your Tools**: Track tool changes across versions 5. **Automate Checks**: Add fast analysis to CI pipelines Fast mode shows what tools *should* be available based on build-time analysis. Live mode shows what *actually* works right now. ## See Also * [`hud debug`](/reference/cli/debug) - Test tool functionality * [`hud pull`](/reference/cli/pull) - Get environments with metadata * [`hud build`](/reference/cli/build) - Generate fresh metadata * [`hud run`](/reference/cli/run) - Execute environments # hud build Source: https://docs.hud.so/reference/cli/build Build production images and generate lock files for reproducible environments The `hud build` command creates production Docker images and generates `hud.lock.yaml` files that capture all metadata needed for reproducible deployments. ## Usage ```bash hud build [DIRECTORY] [OPTIONS] ``` ## Arguments Environment directory containing Dockerfile and pyproject.toml ## Options Docker image tag. Auto-generates based on directory name if not specified. Short: `-t` Build without using Docker cache Show detailed build output. Short: `-v` ## What It Does The build command performs several critical steps: Runs `docker build` with appropriate tags and build arguments Connects to the built image and analyzes: * Initialization time * Available tools and descriptions * Required/optional environment variables * Protocol capabilities Creates `hud.lock.yaml` with: * Image digest (sha256 hash) * Build metadata * Tool summaries * Environment requirements Adds metadata labels to the Docker image for tracking ## Lock File Structure The generated `hud.lock.yaml` contains: ```yaml version: "1.0" image: "my-env:latest@sha256:abc123..." build: generatedAt: "2024-01-15T10:30:00Z" hudVersion: "0.1.0" directory: "my-env" environment: initializeMs: 450 toolCount: 12 variables: required: ["API_KEY", "DATABASE_URL"] optional: ["DEBUG", "LOG_LEVEL"] tools: - name: "setup" description: "Initialize environment" - name: "evaluate" description: "Return environment state" - name: "process_data" description: "Process input data with advanced algorithms..." ``` ## Examples ### Basic Build Build with auto-generated tag: ```bash hud build # Creates: hud-my-env:latest ``` ### Custom Tag Specify version tag: ```bash hud build . --tag v1.2.0 # Creates: hud-my-env:v1.2.0 ``` ### Clean Build Force rebuild without cache: ```bash hud build . --no-cache --verbose ``` ### CI/CD Build Build with specific naming: ```bash hud build . --tag "prod-$(git rev-parse --short HEAD)" # Creates: hud-my-env:prod-abc123f ``` ## Image Naming If no tag is specified via `--tag`, the command: 1. Checks `pyproject.toml` for `[tool.hud.image]` 2. Auto-generates: `hud-{directory-name}:latest` 3. Converts underscores to hyphens for Docker compatibility Examples: * Directory `my_tool` → Image `hud-my-tool:latest` * Directory `browser-env` → Image `hud-browser-env:latest` ## Build Process ### Dockerfile Detection The command expects a `Dockerfile` in the environment directory: ```dockerfile FROM python:3.10-slim RUN pip install uv WORKDIR /app COPY pyproject.toml . RUN uv pip install --system -e . COPY src/ src/ CMD ["python", "-m", "my_package.server"] ``` ### Environment Variables Automatically extracts from Dockerfile: * `ENV` declarations → Optional variables * `ARG` with no default → Required variables ```dockerfile # Detected as required ARG API_KEY # Detected as optional ENV DEBUG=false ENV LOG_LEVEL=info ``` ### Build Arguments Passes through Docker build arguments: ```bash hud build . --build-arg API_VERSION=v2 ``` ## Lock File Usage The generated `hud.lock.yaml` enables: Exact image reference with digest ensures identical deployment Tool information available without running container Metadata for `hud push` to share environments Version tracking for debugging and updates ## Integration ### With hud push After building, share your environment: ```bash hud build . --tag v1.0.0 hud push ``` ### With CI/CD ```yaml # GitHub Actions example - name: Build HUD environment run: | hud build . --tag "${{ github.sha }}" - name: Verify build run: | hud debug "hud-my-env:${{ github.sha }}" - name: Push to registry run: | hud push ``` ### Version Management ```bash # Semantic versioning hud build . --tag v1.0.0 hud build . --tag v1.1.0-beta # Date-based hud build . --tag "$(date +%Y%m%d)" # Git-based hud build . --tag "$(git describe --tags --always)" ``` ## Troubleshooting ### Build Failures Check Dockerfile syntax and ensure all files are present: ```bash docker build . --no-cache ``` The environment may be taking too long to initialize: ```bash hud build . --verbose # See detailed logs ``` Ensure the environment implements required tools: * `setup()` - Must exist * `evaluate()` - Must return a dictionary ### Performance Tips Structure your Dockerfile to maximize layer caching: 1. Install system dependencies first 2. Copy dependency files (pyproject.toml) 3. Install Python dependencies 4. Copy source code last Using `--no-cache` significantly increases build time. Only use when necessary. ## Best Practices 1. **Tag Consistently**: Use semantic versioning or git-based tags 2. **Cache Wisely**: Structure Dockerfile for optimal caching 3. **Test Builds**: Always run `hud debug` after building 4. **Document Variables**: Clearly indicate required vs optional env vars 5. **Keep Images Small**: Use slim base images and multi-stage builds ## See Also * [`hud init`](/reference/cli/init) - Create new environments * [`hud dev`](/reference/cli/dev) - Development with hot-reload * [`hud push`](/reference/cli/push) - Share to registry * [`hud analyze`](/reference/cli/analyze) - Inspect built images * [Build Environments](/build-environments) - Complete guide # hud debug Source: https://docs.hud.so/reference/cli/debug Test MCP environments through 5 validation phases The `hud debug` command validates MCP environments through 5 progressive phases. ## Synopsis ```bash hud debug [OPTIONS] IMAGE [DOCKER_ARGS...] hud debug --command COMMAND [ARGS...] hud debug --cursor SERVER_NAME ``` ## Options | Option | Description | Default | | ------------- | --------------------------------- | ------- | | `--max-phase` | Stop at specified phase (1-5) | 5 | | `--command` | Debug a command instead of Docker | - | | `--cursor` | Debug a Cursor MCP server | - | | `--timeout` | Timeout per phase in seconds | 30 | | `-e, --env` | Set environment variables | - | ## Debug Phases ### Phase 1: Container Startup ``` 🚀 Phase 1: Container startup ✓ Container started successfully ✓ Received output on stderr ``` Checks: Process starts, outputs to stderr, doesn't exit ### Phase 2: MCP Initialization ``` 🔗 Phase 2: MCP initialization ✓ Received valid initialize response ✓ Protocol version: 1.0 ``` Checks: JSON-RPC communication, protocol compatibility ### Phase 3: Tool Discovery ``` 🔧 Phase 3: Tool discovery ✓ Found 8 tools: • setup_board - Initialize game board • move - Move in direction ``` Checks: Tools registered, valid schemas ### Phase 4: Tool Execution ``` 🧪 Phase 4: Tool testing → Testing tool: setup_board ✓ Tool executed successfully ``` Checks: Tools callable, return valid responses ### Phase 5: Readiness Check ``` ✅ Phase 5: Readiness check ✓ Ready for agent interaction ``` Checks: Overall health, state persistence Once your environment passes Phase 5, consider using `hud dev` for hot-reload development. This eliminates rebuild cycles and speeds up iteration. ## Examples ### Docker Debug ```bash # Basic hud debug my-server:latest # With environment variables hud debug my-server:latest -e DEBUG=true -e LOG_LEVEL=debug # Stop at phase 3 hud debug my-server:latest --max-phase 3 ``` ### Local Development ```bash # Python script hud debug --command "python my_server.py" # With arguments hud debug --command "python -m my_module.server --port 8000" ``` ### Cursor Integration ```bash # Debug Cursor server hud debug --cursor my-dev-env ``` ## Common Issues ### Phase 1 Failures * **Container exits**: Check CMD/ENTRYPOINT * **No stderr**: Add `print(..., file=sys.stderr)` * **Permission denied**: Check file permissions ### Phase 2 Failures * **Invalid JSON**: Only JSON-RPC on stdout * **Timeout**: Check for blocking operations ### Phase 3 Failures * **No tools**: Check `@server.tool()` decorators * **Invalid schema**: Add type hints and docstrings ## Advanced Usage ### Incremental Debugging ```bash # Debug phase by phase for phase in 1 2 3 4 5; do echo "Testing phase $phase..." hud debug my-env:latest --max-phase $phase || break done ``` ### Development Mode ```bash # Once debug passes, use hot-reload for development cd environments/my-env hud dev . --build # Make changes to src/ files # Test changes without rebuilding hud analyze hud-my-env:dev ``` ### Parallel Testing ```bash # Test multiple environments images=("env1:latest" "env2:latest") for image in "${images[@]}"; do hud debug "$image" > "debug_${image//:/}.log" 2>&1 & done wait ``` ## Best Practices 1. Start with phase 1 and work up 2. Use `--max-phase` for incremental debugging 3. Check stderr output first 4. Keep tools simple for easier debugging 5. Add health check tools If debug passes, agents should work reliably with your environment. ## Next Step Explore environment capabilities after debugging # hud dev Source: https://docs.hud.so/reference/cli/dev Hot-reload development server for MCP environments The `hud dev` command provides hot-reload development for MCP environments, automatically restarting your server when source files change. ## Synopsis ```bash hud dev [DIRECTORY] [OPTIONS] ``` ## Description `hud dev` creates a development proxy server that runs your MCP environment in Docker with hot-reload capabilities. It automatically detects or builds your Docker image and exposes it via HTTP or stdio for development. **Key Features:** * **Auto-detection**: Automatically finds or generates Docker image names * **Hot-reload**: Mounts local `src/` directory for live code changes * **Interactive testing**: Test tools directly from the command line with arrow keys * **HTTP/Stdio protocols**: Choose transport method * **Inspector support**: Launch MCP Inspector for debugging * **Container management**: Handles Docker lifecycle automatically ## Arguments | Argument | Description | Default | | ----------- | -------------------------------- | ------------- | | `DIRECTORY` | Environment directory (optional) | `.` (current) | ## Options | Option | Description | Default | | ------------------- | -------------------------------------------- | ------------- | | `--image`, `-i` | Docker image name (overrides auto-detection) | Auto-detected | | `--build`, `-b` | Build image before starting | `false` | | `--no-cache` | Force rebuild without cache | `false` | | `--transport`, `-t` | Transport protocol: `http` or `stdio` | `http` | | `--port`, `-p` | HTTP server port (ignored for stdio) | `8765` | | `--interactive` | Launch interactive tool testing interface | `false` | | `--no-reload` | Disable hot-reload functionality | `false` | | `--verbose`, `-v` | Show detailed server logs | `false` | | `--inspector` | Launch MCP Inspector (HTTP mode only) | `false` | | `--no-logs` | Disable streaming Docker logs | `false` | ## Examples ### Auto-Detection Mode (Recommended) Start development server without specifying directory (auto-detects from current location): ```bash hud dev ``` ### Build and Start Build Docker image first, then start development server: ```bash hud dev --build ``` ### Specific Directory Watch a specific environment directory: ```bash hud dev environments/my-browser-env --build ``` ### Custom Image Use a specific Docker image name: ```bash hud dev . --image my-custom-env:dev --build ``` ### HTTP Mode with Inspector Start in HTTP mode and launch MCP Inspector: ```bash hud dev . --build --transport http --inspector ``` ### Stdio Mode Use stdio transport (for multiple simultaneous connections): ```bash hud dev . --build --transport stdio ``` ### Clean Rebuild Force a complete rebuild without Docker cache: ```bash hud dev . --build --no-cache ``` ### Verbose Logging Show detailed logs from the container: ```bash hud dev . --build --verbose ``` ### Interactive Testing Launch with interactive tool testing interface: ```bash hud dev . --build --interactive ``` Interactive mode disables Docker logs and hot-reload to provide a clean interface for testing tools with arrow key navigation. ## How It Works 1. **Image Resolution**: * Checks `pyproject.toml` for `[tool.hud.image]` setting * Auto-generates name like `hud-{directory-name}:dev` if not found * Uses `--image` override if provided 2. **Docker Setup**: * Builds image with `--build` if requested * Mounts local `src/` directory to `/app/src` in container * Injects hot-reload supervisor (unless `--no-reload`) 3. **Proxy Server**: * Creates FastMCP proxy that forwards to Docker containers * Each client connection gets its own container instance * Automatically handles container lifecycle 4. **Development Flow**: * Edit files in `src/` directory * Changes are immediately reflected in running containers * No rebuild required for code changes * **Only the MCP server restarts**, not your environment state ### Process Separation Architecture For stateful environments, `hud dev` supports a critical design pattern: separating the MCP server from the environment process. This separation enables hot-reload without losing state. **Why Separation Matters:** * MCP server can restart instantly for code changes * Environment state (browsers, databases, games) persists * Development is faster without constant state resets **Example Architecture:** ``` ┌─────────────────┐ ┌──────────────────────┐ │ MCP Server │────▶│ Environment Process │ │ (Restartable) │ │ (Persistent) │ │ │ │ │ │ - Lightweight │ │ - Maintains state │ │ - Fast startup │ │ - Long-lived │ │ - MCP protocol │ │ - Resource heavy │ └─────────────────┘ └──────────────────────┘ ▲ │ │ │ └─── Unix Socket ─────────┘ TCP, gRPC, etc. ``` **Connection Methods:** * Unix sockets (recommended for local dev) * TCP/HTTP endpoints * gRPC services * Shared memory/IPC See the browser environment for a complete implementation example. ## File Mounting The development server mounts your local `src/` directory directly into containers: ``` Local: ./src/ → Container: /app/src/ Local: ./src/my_package/ → Container: /app/src/my_package/ Local: ./src/server.py → Container: /app/src/server.py ``` **Live Changes**: Any edits to files in `src/` are immediately reflected in running containers. **Hot-Reload Behavior**: * With process separation: Only the MCP server restarts, environment state persists * Without separation: Entire environment restarts (state is lost) * Disable with `--no-reload` if manual control is preferred For environments with important state (browsers, databases, running applications), implement process separation to preserve state during hot-reload. This dramatically improves development speed. ## Interactive Mode When launched with `--interactive`, `hud dev` provides a command-line interface for testing tools: ```bash hud dev . --interactive ``` **Features:** * **Arrow Key Navigation**: Select tools using up/down arrows * **Parameter Input**: Enter tool parameters with validation * **Result Display**: See tool responses in real-time * **Clean Interface**: Disables logs for focused testing * **Exit Gracefully**: Select "Quit" or press Ctrl+C **Workflow:** 1. Environment starts and connects 2. Tools list appears with arrow navigation 3. Select a tool and press Enter 4. Input parameters if required 5. See results immediately 6. Continue testing or quit Interactive mode is perfect for rapid tool development - make changes to your code and test immediately without leaving the terminal. ## Transport Modes ### HTTP Transport (Default) ```bash hud dev . --transport http --port 8080 ``` **Benefits:** * Web browser access * MCP Inspector integration * Multiple simultaneous connections * RESTful debugging **URL:** `http://localhost:8080/mcp` ### Stdio Transport ```bash hud dev . --transport stdio ``` **Benefits:** * Direct MCP protocol * Lower latency * Cursor/Claude integration ready * Single connection focused **Usage:** Connect via stdio to the running container ## Integration Examples ### Cursor Integration 1. Start development server: ```bash hud dev . --build --transport http --port 8765 ``` 2. Add to Cursor's MCP config: ```json { "mcpServers": { "my-dev-env": { "url": "http://localhost:8765/mcp" } } } ``` 3. Edit files - changes apply immediately! ### Testing During Development Run analysis in another terminal: ```bash # Test tools are working hud analyze hud-my-env:dev # Full debug check hud debug hud-my-env:dev ``` ### MCP Inspector Launch inspector for visual debugging: ```bash hud dev . --build --inspector ``` Opens browser with MCP Inspector interface showing: * Available tools and schemas * Real-time tool calls * Protocol messages * Error debugging ## Troubleshooting ### Port Already in Use ```bash # Use different port hud dev . --port 9000 # Or find and kill existing process lsof -ti :8765 | xargs kill ``` ### Build Failures ```bash # Clean rebuild hud dev . --build --no-cache # Check Dockerfile docker build -t my-env . ``` ### File Changes Not Detected ```bash # Check file permissions ls -la src/ # Force restart (Ctrl+C then restart) hud dev . --build --verbose ``` ### Container Startup Issues ```bash # Debug with full logs hud dev . --build --verbose # Test image manually docker run -it my-env:dev ``` ## Performance Tips 1. **Exclude Large Files**: Use `.dockerignore` to exclude large files from build context 2. **Layer Caching**: Structure Dockerfile for optimal layer caching 3. **Minimal Rebuilds**: Only use `--build` when dependencies change 4. **Port Selection**: Use uncommon ports to avoid conflicts ## Development Workflow Recommended workflow for environment development: ```bash # 1. Navigate to your environment directory cd environments/my-env # 2. Start development server with interactive testing hud dev --build --interactive # 3. Test tools interactively with arrow keys # Make code changes and test again immediately # 4. Alternative: Run dev server and test separately # Terminal 1: Keep hud dev running hud dev --build # Terminal 2: Test your changes hud analyze hud-my-env:dev # 5. Production build hud build # Creates hud.lock.yaml # 6. Share your environment hud push ``` Using `hud dev --build` also generates or updates `hud.lock.yaml` with environment metadata, similar to `hud build`. ## See Also * [`hud init`](/reference/cli/init) - Create new environments * [`hud build`](/reference/cli/build) - Build production images * [`hud push`](/reference/cli/push) - Share to registry * [`hud analyze`](/reference/cli/analyze) - Test tools after changes * [`hud debug`](/reference/cli/debug) - Full environment validation * [`hud run`](/reference/cli/run) - Run production environments * [Build Environments](/build-environments) - Environment development guide # hud init Source: https://docs.hud.so/reference/cli/init Create a new HUD environment with minimal boilerplate The `hud init` command creates a new HUD environment from scratch with all the essential files needed to build, test, and deploy an MCP server. ## Usage ```bash hud init [NAME] [OPTIONS] ``` ## Arguments Environment name. If not provided, uses the current directory name. ## Options Target directory where the environment will be created. Short: `-d` Overwrite existing files if they exist. Short: `-f` ## What It Creates The command generates a minimal but complete HUD environment structure: ``` my-env/ ├── Dockerfile # Container configuration ├── pyproject.toml # Dependencies and metadata ├── README.md # Basic documentation └── src/ └── hud_controller/ ├── context.py # Persistent state management └── server.py # MCP server implementation ``` ### Generated Files Minimal MCP server with required tools: ```python from hud.server import MCPServer from hud.server.context import attach_context mcp = MCPServer(name="my-env") ctx = None @mcp.initialize async def init(init_ctx): global ctx ctx = attach_context("/tmp/hud_ctx.sock") @mcp.tool() async def act() -> str: """Perform an action.""" return f"Action #{ctx.act()}" @mcp.tool() async def setup() -> str: """Required for HUD environments.""" return "Ready" @mcp.tool() async def evaluate() -> dict: """Required for HUD environments.""" return {"count": ctx.get_count()} if __name__ == "__main__": mcp.run() ``` Persistent state that survives hot-reloads: ```python from hud.server.context import run_context_server import asyncio class Context: def __init__(self): self.count = 0 def act(self): self.count += 1 return self.count def get_count(self): return self.count if __name__ == "__main__": asyncio.run(run_context_server(Context())) ``` Production-ready container setup: ```dockerfile FROM python:3.10-slim RUN pip install uv WORKDIR /app COPY pyproject.toml . RUN uv pip install --system -e . COPY src/ src/ CMD ["python", "-m", "hud_controller.server"] ``` Project configuration with HUD metadata: ```toml [project] name = "my-env" version = "0.1.0" dependencies = ["hud[sdk]"] [tool.hud] image = "hud-my-env:dev" ``` ## Examples ### Basic Usage Create an environment in a new directory: ```bash hud init my-env cd my-env ``` ### Current Directory Initialize in the current directory using its name: ```bash mkdir awesome-tool && cd awesome-tool hud init ``` ### Custom Location Create in a specific directory: ```bash hud init my-env --dir ~/projects cd ~/projects/my-env ``` ### Overwrite Existing Force overwrite existing files: ```bash hud init my-env --force ``` ## Next Steps After creating your environment: Run with hot-reload and interactive testing: ```bash hud dev . --interactive ``` Edit `src/hud_controller/server.py` to add custom tools: ```python @mcp.tool() async def my_custom_tool(text: str) -> str: """Process some text.""" return text.upper() ``` In interactive mode, use arrow keys to select and test tools When ready, build and share: ```bash hud build hud push ``` ## Key Concepts ### Required Tools Every HUD environment must implement: * `setup()` - Initialize the environment * `evaluate()` - Return environment state for evaluation ### Context Pattern The context pattern separates state from server logic: * State persists across hot-reloads * Server reconnects to existing context * Enables rapid development iteration ### Minimal Dependencies The template uses only essential dependencies: * `hud[sdk]` - Core HUD functionality * No unnecessary packages * Production-ready from the start ## Tips The generated environment is immediately runnable with `hud dev`. No additional setup required! Environment names are converted to valid Docker image names automatically (underscores → hyphens). Use `--force` carefully as it will overwrite existing files without confirmation. ## See Also * [hud dev](/reference/cli/dev) - Start development with hot-reload * [hud build](/reference/cli/build) - Build production images * [Build Environments Guide](/build-environments) - Complete environment development guide # CLI Overview Source: https://docs.hud.so/reference/cli/overview Complete reference for HUD command-line tools The HUD CLI provides a complete toolkit for creating, developing, and running MCP environments. Commands are organized into two main workflows: **Directory-based commands** for creating and sharing environments: * `hud init` - Create new environment * `hud dev` - Develop with hot-reload * `hud build` - Build and generate lock file * `hud push` - Share to registry **Target-based commands** for consuming environments: * `hud pull` - Get from registry * `hud analyze` - Inspect capabilities * `hud debug` - Test compliance * `hud run` - Execute locally/remotely ## Installation ```bash uv (Recommended) uv tool install hud-python hud --version ``` ```bash pip pip install hud-python python -m hud.cli --help ``` ```bash pipx pipx install hud-python hud --version ``` ## Commands ### Building Workflow | Command | Input | Description | Example | | ----------- | --------- | ----------------------- | ------------------------- | | `hud init` | Directory | Create new environment | `hud init my-env` | | `hud dev` | Directory | Hot-reload development | `hud dev . --interactive` | | `hud build` | Directory | Build image & lock file | `hud build . --tag v1.0` | | `hud push` | Directory | Share to registry | `hud push . --tag prod` | ### Running Workflow | Command | Input | Description | Example | | ------------- | ------ | ----------------------------- | ------------------------- | | `hud pull` | Target | Get environment from registry | `hud pull org/env:latest` | | `hud analyze` | Target | Inspect tools & capabilities | `hud analyze org/env` | | `hud debug` | Target | Test MCP compliance | `hud debug my-env:latest` | | `hud run` | Target | Execute server | `hud run my-env --local` | ### Utilities | Command | Description | Example | | ----------------- | --------------------------- | ----------------- | | `hud quickstart` | Clone quickstart repository | `hud quickstart` | | `hud cursor-list` | List Cursor MCP servers | `hud cursor-list` | | `hud version` | Show version info | `hud version` | ## Complete Workflows ### Building an Environment Create a new HUD environment with minimal boilerplate: ```bash hud init my-env && cd my-env ``` Creates `Dockerfile`, `pyproject.toml`, and minimal MCP server with context. Run with hot-reload and interactive testing: ```bash hud dev ``` Your changes reload automatically. Test tools interactively with arrow keys. Create production image and lock file: ```bash hud build ``` Generates `hud.lock.yaml` with metadata and labels image for reproducibility. Share to Docker and HUD registries: ```bash hud push ``` Requires `HUD_API_KEY`. Auto-detects registry from Docker login. ### Running an Environment Get environment with metadata preview: ```bash hud pull hudpython/text_init:latest ``` Shows tools, env vars, and build info before downloading. Quick inspection without running: ```bash hud analyze hudpython/text_init # Fast (from metadata) hud analyze hudpython/text_init --live # Full (runs container) ``` Test MCP protocol compliance: ```bash hud debug hudpython/text_init:latest ``` Validates through 5 phases of initialization. Execute in production mode: ```bash hud run hudpython/text_init:latest # Remote (default) hud run hudpython/text_init:latest --local # Local Docker ``` ## Common Usage ### Docker Images ```bash # Basic hud debug my-image:latest # With options hud debug my-image:latest -e DEBUG=true -p 8080:8080 # From registry hud analyze ghcr.io/org/image:v1.0.0 ``` ### Local Commands ```bash # Python script hud debug --command "python server.py" # Node.js server hud debug --command "node mcp-server.js" ``` ### Cursor Integration ```bash # Debug Cursor server hud debug --cursor my-dev-server # List all servers hud cursor-list ``` ## Output Formats ### Interactive (Default) ```bash hud analyze my-env 🔍 Analyzing MCP environment: my-env ✓ Connected successfully ✓ Found 12 tools Available Tools: • click - Click at coordinates • type - Type text ... ``` ### JSON ```bash hud analyze my-env --format json { "tools": [{ "name": "click", "description": "Click at coordinates", "parameters": {...} }] } ``` ### Markdown ```bash hud analyze my-env --format markdown > docs/tools.md ``` ## CI/CD Example ```bash #!/bin/bash set -e # Test environment hud debug "$IMAGE_NAME" # Verify tools TOOLS=$(hud analyze "$IMAGE_NAME" --format json | jq '.tools | length') if [ "$TOOLS" -lt 3 ]; then echo "Not enough tools!" exit 1 fi ``` ## Python Scripting ```python import subprocess import json def get_tools(image): result = subprocess.run( ["hud", "analyze", image, "--format", "json"], capture_output=True, text=True, check=True ) return json.loads(result.stdout)["tools"] # Use tools = get_tools("my-env:latest") for tool in tools: print(f"- {tool['name']}: {tool['description']}") ``` ## Exit Codes | Code | Meaning | Description | | ---- | ---------------- | ------------------- | | 0 | Success | Command completed | | 1 | General Error | Command failed | | 2 | Usage Error | Invalid arguments | | 3 | Connection Error | Failed to connect | | 4 | Timeout | Operation timed out | | 5 | Protocol Error | MCP violation | ## Environment Variables ```bash # Debug output export HUD_CLI_DEBUG=true # Custom timeout export HUD_CLI_TIMEOUT=120 # Provider keys export ANCHOR_API_KEY=... ``` ## Next Steps ### Building Commands Create new environments from scratch Develop with hot-reload and interactive testing Build images and generate lock files Share environments to registry ### Running Commands Get environments with metadata preview Inspect tools and capabilities Test MCP protocol compliance Execute servers locally or remotely # hud pull Source: https://docs.hud.so/reference/cli/pull Get HUD environments from registry with metadata preview The `hud pull` command fetches HUD environments from registries, showing a preview of tools and capabilities before downloading the Docker image. ## Usage ```bash hud pull [OPTIONS] ``` ## Arguments Image reference (e.g., `org/env:tag`) or path to lock file ## Options Path to save lock file (if target is image ref). Short: `-l` Skip confirmation prompt. Short: `-y` Only verify metadata without pulling image Show detailed output. Short: `-v` ## How It Works Retrieves environment information from: * HUD registry (if available) * Docker manifest (fallback) * Local lock file (if provided) Shows environment details: * Available tools and descriptions * Build information * Required environment variables * Image size Prompts for confirmation before pulling (unless `--yes`) Downloads Docker image using `docker pull` Stores metadata locally for offline access ## Examples ### Basic Pull Pull from HUD registry: ```bash hud pull hudpython/text-analyzer:latest ``` Output: ``` 🚀 HUD Environment Pull Fetching metadata for: hudpython/text-analyzer:latest 📦 Environment Details ┌─────────────┬──────────────────────────┐ │ Image │ hudpython/text-analyzer │ │ Source │ HUD Registry │ │ Built │ 2024-01-15T10:30:00Z │ │ HUD Version │ 0.1.0 │ │ Size │ 156 MB │ │ Tools │ 8 │ └─────────────┴──────────────────────────┘ 🔧 Available Tools ├── setup - Initialize environment ├── evaluate - Return environment state ├── analyze_text - Analyze text sentiment ├── extract_entities - Extract named entities └── ... 4 more tools 🔑 Environment Variables ├── Required: OPENAI_API_KEY └── Optional: DEBUG, LOG_LEVEL Pull hudpython/text-analyzer:latest? [y/N]: y ``` ### Skip Confirmation Auto-pull for scripts: ```bash hud pull myorg/myenv:v1.0.0 --yes ``` ### Verify Only Check metadata without downloading: ```bash hud pull myorg/myenv:latest --verify-only ``` ### From Lock File Pull using a shared lock file: ```bash # Someone shares their hud.lock.yaml hud pull ./shared-env.lock.yaml ``` ### Custom Lock Location Save lock file to specific path: ```bash hud pull myorg/myenv:latest --lock ~/.hud/locks/myenv.yaml ``` ## Metadata Sources The command tries these sources in order: 1. **HUD Registry** (Fastest) * Full tool descriptions * Build metadata * Environment requirements 2. **Docker Registry** (Limited) * Basic image info * Size and layers * No tool details 3. **Lock File** (Complete) * Full environment snapshot * Reproducible reference * Offline capable ## Lock File Storage Downloaded lock files are saved to: ``` ~/.hud/locks/{org}_{name}_{tag}.yaml ``` Examples: * `hudpython_text-analyzer_latest.yaml` * `myorg_browser-env_v1.0.0.yaml` Lock files are cached locally, so repeated `hud analyze` commands are instant. ## Next Steps After pulling, the command shows: ``` ✅ Image pulled successfully ✅ Lock file saved to: ~/.hud/locks/hudpython_text-analyzer_latest.yaml Next Steps: 1. Quick analysis: hud analyze hudpython/text-analyzer 2. Live analysis: hud analyze hudpython/text-analyzer --live 3. Run locally: hud run hudpython/text-analyzer:latest ``` ## Registry Access ### Public Environments No authentication needed: ```bash # HUD's public environments hud pull hudpython/code-analyzer:latest hud pull hudpython/web-browser:latest ``` ### Private Registries Requires Docker login: ```bash # Login first docker login private-registry.com # Then pull hud pull private-registry.com/team/tool:latest ``` ### API Key Benefits With `HUD_API_KEY` set: * Access private HUD metadata * Faster metadata retrieval * Additional environment details ```bash export HUD_API_KEY="your-key" hud pull myorg/private-env:latest ``` ## Troubleshooting ### Common Issues The environment may not have been pushed to HUD registry: * Metadata will be limited to Docker manifest info * Ask the author to run `hud push` * Use `--verbose` to see fallback behavior Check Docker authentication: ```bash # Verify login docker login # Test manual pull docker pull myorg/myenv:latest ``` HUD registry may be slow or unavailable: ```bash # Skip HUD registry, use Docker only hud pull myorg/myenv:latest --verbose ``` ### Network Issues Behind a proxy? Configure Docker: ```bash export HTTP_PROXY=http://proxy:8080 export HTTPS_PROXY=http://proxy:8080 ``` ## Integration Examples ### CI/CD Pipeline ```yaml # GitHub Actions - name: Pull HUD environment run: | hud pull myorg/test-env:latest --yes hud analyze myorg/test-env:latest ``` ### Docker Compose ```yaml services: mcp-env: image: myorg/myenv:latest environment: - API_KEY=${API_KEY} ``` ### Scripted Setup ```bash #!/bin/bash # Pull multiple environments ENVS=("web-browser" "code-analyzer" "data-processor") for env in "${ENVS[@]}"; do echo "Pulling $env..." hud pull "hudpython/$env:latest" --yes done ``` ## Best Practices 1. **Version Pinning**: Use specific tags ```bash hud pull myorg/tool:v1.2.3 # Good hud pull myorg/tool:latest # Risky ``` 2. **Cache Lock Files**: Share lock files for reproducibility ```bash # Save to project hud pull myorg/env:v1 --lock ./env.lock.yaml git add env.lock.yaml ``` 3. **Verify First**: Check before downloading large images ```bash hud pull myorg/large-env:latest --verify-only ``` 4. **Document Requirements**: Note any required env vars ```bash # After pull, check requirements hud analyze myorg/env:latest ``` ## See Also * [`hud push`](/reference/cli/push) - Share environments * [`hud analyze`](/reference/cli/analyze) - Inspect pulled environments * [`hud run`](/reference/cli/run) - Execute pulled environments * [`hud debug`](/reference/cli/debug) - Test pulled environments * [Build Environments](/build-environments/registry) - Registry guide # hud push Source: https://docs.hud.so/reference/cli/push Share HUD environments to Docker and HUD registries The `hud push` command publishes your HUD environment to both Docker registry and HUD registry, making it available for others to use with `hud pull`. ## Usage ```bash hud push [DIRECTORY] [OPTIONS] ``` ## Arguments Environment directory containing hud.lock.yaml ## Options Override registry image name (e.g., `myorg/myenv`). Short: `-i` Override tag (e.g., `v1.0`, `latest`). Short: `-t` Sign the image with cosign (not yet implemented) Skip confirmation prompts. Short: `-y` Show detailed output. Short: `-v` ## Prerequisites `hud push` requires an API key. Set it via environment variable: ```bash export HUD_API_KEY="your-api-key" ``` You must be logged in to a Docker registry: ```bash docker login # or docker login ghcr.io ``` ## What It Does Reads `hud.lock.yaml` to get image reference and metadata Creates registry tag from local image: * Uses `--image` if provided * Auto-detects from Docker config * Defaults to Docker Hub format Pushes image to Docker registry and gets digest Updates `hud.lock.yaml` with: * Registry image reference with digest * Push timestamp * Registry information Sends lock file to HUD registry at: `{HUD_API_URL}/registry/envs/{org}/{name}:{tag}` ## Registry Detection The command intelligently detects your registry: 1. **Explicit**: Use `--image` flag ```bash hud push . --image ghcr.io/myorg/myenv ``` 2. **Docker Config**: Reads from `~/.docker/config.json` ```bash docker login ghcr.io hud push # Auto-uses ghcr.io ``` 3. **Default**: Docker Hub format ```bash hud push # Creates: yourusername/envname ``` ## Examples ### Basic Push Push to default registry: ```bash # After hud build hud push ``` Output: ``` 🚀 HUD Environment Push 📋 Local Details Image: hud-my-env:latest Lock: hud.lock.yaml (2.3 KB) Tools: 5 tools available 🎯 Registry Target Local: hud-my-env:latest Registry: myusername/my-env:latest Push to registry? [y/N]: y ✅ Tagged image ✅ Pushed to registry: myusername/my-env@sha256:abc123... ✅ Updated lock file ✅ Metadata uploaded to HUD registry ``` ### Custom Registry Push to GitHub Container Registry: ```bash hud push . --image ghcr.io/myorg/my-tool --tag v1.0.0 ``` ### CI/CD Automation Skip prompts for automation: ```bash hud push . --yes --tag "prod-$(date +%Y%m%d)" ``` ### Override Tag Push with different tag: ```bash hud push . --tag experimental ``` ## Lock File Updates After pushing, `hud.lock.yaml` gains registry information: ```yaml version: "1.0" image: "myorg/myenv:latest@sha256:abc123..." build: generatedAt: "2024-01-15T10:30:00Z" hudVersion: "0.1.0" directory: "my-env" push: source: "hud-my-env:latest" pushedAt: "2024-01-15T11:00:00Z" registry: "ghcr.io" environment: # ... environment details tools: # ... tool information ``` ## Sharing Workflows ### Public Sharing After pushing, others can use: ```bash # Pull with metadata preview hud pull myorg/myenv:latest # Use in MCP configs {"hud": {"registry": "myorg/myenv:latest"}} # Direct Docker usage docker run myorg/myenv:latest ``` ### Private Registries For private registries, users need: 1. Docker registry access: ```bash docker login private-registry.com ``` 2. HUD API key (if using private HUD registry): ```bash export HUD_API_KEY="shared-key" ``` ## Usage Messages After successful push, helpful usage examples are shown: ``` 📦 Usage: Test locally: hud run myorg/myenv:latest Use in MCP configs: Claude Desktop: {"docker": {"image": "myorg/myenv@sha256:...", "command": "auto"}} Via HUD (recommended): {"hud": {"registry": "myorg/myenv:latest"}} Share environment: Others can now pull with: hud pull myorg/myenv:latest ``` ## Troubleshooting ### Authentication Issues ```bash # Docker Hub docker login # GitHub Registry docker login ghcr.io -u USERNAME # Custom registry docker login my-registry.com ``` ```bash # Set API key export HUD_API_KEY="your-key-here" # Or create .env file echo "HUD_API_KEY=your-key-here" > .env ``` Ensure you have push access to the registry namespace: * Docker Hub: Must own the namespace * GitHub: Must have package write permissions * Private: Check with your admin ### Push Failures For large images, push may timeout. Use `--verbose` to see progress: ```bash hud push . --verbose ``` If metadata upload fails, the Docker image is still pushed. You can retry metadata upload or share `hud.lock.yaml` manually. ## Best Practices 1. **Version Tags**: Use semantic versioning ```bash hud push . --tag v1.0.0 hud push . --tag v1.1.0-beta ``` 2. **Namespace Organization**: Group related environments ```bash hud push . --image myorg/tools/browser hud push . --image myorg/tools/scraper ``` 3. **Documentation**: Update README with pull instructions ````markdown ## Installation ```bash hud pull myorg/myenv:latest ```` ``` ``` 4. **CI Integration**: Automate pushes ```yaml - name: Push HUD environment env: HUD_API_KEY: ${{ secrets.HUD_API_KEY }} run: hud push . --yes --tag ${{ github.ref_name }} ``` ## Security The HUD registry only stores metadata (lock files), not code or images. Actual images are stored in Docker registries with their existing security models. * **API Keys**: Keep `HUD_API_KEY` secret * **Registry Auth**: Use Docker's credential helpers * **Image Signing**: Coming soon with `--sign` flag * **Private Envs**: Use private Docker registries ## See Also * [`hud build`](/reference/cli/build) - Build before pushing * [`hud pull`](/reference/cli/pull) - Get shared environments * [`hud analyze`](/reference/cli/analyze) - Inspect pushed environments * [Build Environments](/build-environments/registry) - Registry guide # hud run Source: https://docs.hud.so/reference/cli/run Execute MCP environments locally or remotely for production use The `hud run` command executes MCP environments either locally via Docker or remotely via the HUD platform. Unlike `hud dev` which is for development, `hud run` is optimized for production execution. ## Usage ```bash hud run [OPTIONS] [-- DOCKER_ARGS...] ``` ## Arguments Docker image reference (e.g., `myorg/env:latest`) Additional Docker arguments after `--` (e.g., `-e KEY=value`) Always use `--` before Docker arguments to avoid conflicts with HUD options: * ✅ `hud run image:tag -- -e KEY=value` * ❌ `hud run image:tag -e KEY=value` ## Options Run locally with Docker instead of remote execution Explicitly run remotely (default behavior) Transport protocol: `stdio` or `http`. Short: `-t` Port for HTTP transport (ignored for stdio). Short: `-p` Remote MCP server URL API key for remote server (defaults to `HUD_API_KEY` env var) Run ID for tracking (remote only) Show detailed output. Short: `-v` ## Remote Execution (Default) By default, `hud run` executes your Docker image on the HUD platform's remote infrastructure. This provides: * **Scalability**: Run hundreds of concurrent instances * **Resource Management**: Automatic container lifecycle management * **Monitoring**: Live telemetry and performance metrics * **Geographic Distribution**: Global edge deployment ### Remote Examples ```bash # Basic remote execution hud run hudpython/text-2048:latest # With environment variables hud run myorg/server:v1 -- -e API_KEY=secret -e DEBUG=true # HTTP transport with custom port hud run myorg/server:v1 --transport http --port 9000 # Custom remote URL and API key hud run my-image:latest --url https://custom.mcp.server --api-key my-key # With run tracking hud run production-env:v1.0 --run-id "deploy-$(date +%Y%m%d)" ``` ### Authentication Remote execution requires authentication via API key: ```bash # Set environment variable (recommended) export HUD_API_KEY=your-api-key-here hud run my-image:latest # Or pass directly (less secure) hud run my-image:latest --api-key your-api-key-here ``` ### Run Tracking Use `--run-id` to track specific execution instances: ```bash hud run my-image:latest --run-id "experiment-001" ``` ## Local Execution Use `--local` to run the image locally with Docker: ```bash # Basic local execution hud run hudpython/text-2048:latest --local # With Docker arguments hud run myorg/server:v1 --local -- -e API_KEY=secret -v /data:/app/data # HTTP transport locally hud run myorg/server:v1 --local --transport http --port 8080 # With memory limits and volumes hud run my-env:latest --local -- --memory 2g -v $(pwd)/data:/app/data ``` Local execution is useful for: * Development and testing * Environments without internet access * Custom Docker configurations * Debugging before remote deployment ## Transport Modes ### Stdio Transport (Default) ```bash hud run my-image:latest --transport stdio ``` **Characteristics:** * Direct JSON-RPC over stdin/stdout * Single client connection * Lower latency * MCP protocol native ### HTTP Transport ```bash hud run my-image:latest --transport http --port 8765 ``` **Characteristics:** * HTTP proxy at specified port * Multiple concurrent connections * Web-compatible * Inspector support ## Docker Arguments Pass Docker arguments after `--` separator: ```bash # Environment variables hud run my-image:latest -- -e DEBUG=true -e API_KEY=secret # Volume mounts hud run my-image:latest -- -v /host/data:/container/data # Network settings hud run my-image:latest -- --network host # Memory limits hud run my-image:latest -- --memory 2g # Combined arguments hud run my-image:latest -- -e API_KEY=secret -v /data:/app/data --memory 1g ``` The `--` separator is required to distinguish HUD options from Docker arguments. ## Environment Variables | Variable | Description | Default | | ------------- | --------------------------------- | --------------------------- | | `HUD_MCP_URL` | Remote MCP server URL | `https://mcp.hud.so/v3/mcp` | | `HUD_API_KEY` | API key for remote authentication | *required for remote* | ## Comparison with Other Commands | Command | Purpose | Hot-reload | Target Use Case | | ----------- | ----------------------------- | ---------- | ----------------------------- | | `hud dev` | Development with auto-restart | ✅ Yes | Local development iteration | | `hud run` | Production execution | ❌ No | Production workloads, testing | | `hud debug` | Environment validation | ❌ No | Debugging and validation | ## Examples ### Basic Usage ```bash # Run remotely (default) hud run hudpython/text-2048:latest # Run locally for testing hud run hudpython/text-2048:latest --local # Verbose output for debugging hud run my-image:latest --local --verbose ``` ### Production Deployment ```bash # Remote execution with tracking export HUD_API_KEY=your-production-key hud run production-env:v1.0 --run-id "prod-deploy-$(date +%s)" # With environment configuration hud run production-env:v1.0 --run-id "prod-$(git rev-parse --short HEAD)" -- \ -e DATABASE_URL=postgres://... \ -e REDIS_URL=redis://... \ -e LOG_LEVEL=info ``` ### Development Testing ```bash # Test locally before remote deployment hud run my-env:dev --local -- -e DEBUG=true # Test with HTTP transport hud run my-env:dev --local --transport http --port 9000 # Test with custom data volume hud run my-env:dev --local -- -v $(pwd)/test-data:/app/data ``` ### CI/CD Integration ```yaml # GitHub Actions example - name: Run HUD environment env: HUD_API_KEY: ${{ secrets.HUD_API_KEY }} run: | # Pull latest hud pull myorg/env:latest --yes # Run tests hud run myorg/env:latest --run-id "ci-${{ github.sha }}" ``` ## Error Handling `hud run` provides clear error messages for common issues: * **Missing image**: Suggests available images or build commands * **Authentication failure**: Guides through API key setup * **Port conflicts**: Suggests alternative ports * **Docker errors**: Shows Docker-specific troubleshooting ## See Also * [`hud pull`](/reference/cli/pull) - Get environments before running * [`hud analyze`](/reference/cli/analyze) - Inspect capabilities first * [`hud debug`](/reference/cli/debug) - Test environment compliance * [`hud dev`](/reference/cli/dev) - Development with hot-reload * [Build Environments](/build-environments) - Environment development guide # Environments Source: https://docs.hud.so/reference/environments SDK reference for building MCP environments The HUD SDK provides `MCPServer` for building MCP-compatible environments that work with any MCP client. ## MCPServer ```python from hud.server import MCPServer ``` Enhanced FastMCP server with Docker-friendly features for building HUD environments. **Constructor Parameters:** | Parameter | Type | Description | Default | | ------------------ | ----- | ------------------------------- | -------- | | `name` | `str` | Server name for MCP handshake | Required | | `instructions` | `str` | Server instructions/description | `None` | | `**fastmcp_kwargs` | `Any` | Additional FastMCP parameters | - | **Key Features:** 1. **SIGTERM handling** - Graceful shutdown in containers via custom runner 2. **Initialize decorator** - Async setup during MCP initialize request 3. **Shutdown decorator** - Cleanup after lifespan context ends 4. **Enhanced add\_tool()** - Automatically handles BaseTool instances 5. **FastMCP inheritance** - All FastMCP methods available (`mount`, `resource`, `tool`) ### Decorators #### @initialize Run async setup during MCP initialize request: ```python mcp = MCPServer(name="my-env") @mcp.initialize async def setup_environment(ctx): """ Initialize environment resources. Args: ctx: RequestContext with: - ctx.meta: Client metadata dict - ctx.session: MCP ServerSession """ # Access metadata from agent (if provided) if ctx.meta: progress_token = ctx.meta.get("progressToken") display_width = ctx.meta.get("display_width", 1920) display_height = ctx.meta.get("display_height", 1080) # Send progress notifications if progress_token: await ctx.session.send_progress_notification( progress_token=progress_token, progress=50, total=100, message="Initializing environment..." ) ``` #### @shutdown Run cleanup after server lifespan ends: ```python @mcp.shutdown async def cleanup(): """Clean up resources on shutdown.""" if browser_provider: browser_provider.close() logger.info("Cleanup complete") ``` ### Tool Registration Three ways to register tools: ```python # 1. Decorator for simple functions @mcp.tool() async def my_tool(param: str) -> dict: return {"result": param} # 2. Add BaseTool instances from hud.tools import BashTool bash = BashTool() mcp.add_tool(bash) # Automatically uses bash.mcp internally # 3. Add non-BaseTool instances directly from custom import PlaywrightTool playwright = PlaywrightTool() mcp.add_tool(playwright) # Added as-is ``` ### Hub Pattern (mount) Use BaseHub for organized tool namespaces: ```python from hud.tools import BaseHub # Create hub setup_hub = BaseHub("setup") # Add internal tools (hidden from agents) @setup_hub.tool("board") async def setup_board(size: int = 4): game = setup_hub.env game.reset(size=size) return [TextContent(text=f"{size}x{size} board initialized")] # Mount hub on server mcp.mount(setup_hub) # Agents call via dispatcher: setup(name="board", arguments={"size": 4}) ``` ### Resources Expose metadata via MCP resources: ```python @mcp.resource("telemetry://live") async def get_telemetry(): """Expose live telemetry data.""" return { "provider": os.getenv("BROWSER_PROVIDER"), "status": "running" if browser_provider else "stopped", "live_url": browser_provider.get_live_view_url() if browser_provider else None, "timestamp": datetime.now().isoformat() } ``` ### Running the Server ```python if __name__ == "__main__": # Run with SIGTERM handling (stdio by default) mcp.run() # Or specify transport mcp.run(transport="sse", port=8080) ``` ## Real Environment Examples ### Minimal Environment ```python # src/hud_controller/server.py from hud.server import MCPServer from mcp.types import TextContent mcp = MCPServer(name="counter-env") counter = {"value": 0} @mcp.tool() async def setup(start_value: int = 0): """Initialize counter.""" counter["value"] = start_value return {"status": "ready", "counter": counter["value"]} @mcp.tool() async def increment(): """Increment counter.""" counter["value"] += 1 return [TextContent(text=f"Counter: {counter['value']}", type="text")] @mcp.tool() async def evaluate(target: int): """Check if target reached.""" from hud.tools.types import EvaluationResult return EvaluationResult( reward=1.0 if counter["value"] >= target else 0.0, done=counter["value"] >= target ) if __name__ == "__main__": mcp.run() ``` ### text\_2048 Environment From `environments/text_2048/src/hud_controller/server.py`: ```python from hud.server import MCPServer from .game import Game2048 from .tools import MoveTool from .setup import setup as setup_hub from .evaluate import evaluate as evaluate_hub mcp = MCPServer(name="text-2048") game = None @mcp.initialize async def initialize_environment(ctx): global game # Progress notifications progress_token = getattr(ctx.meta, "progressToken", None) if ctx.meta else None async def send_progress(progress: int, message: str): if progress_token: await ctx.session.send_progress_notification( progress_token=progress_token, progress=progress, total=100, message=message ) await send_progress(0, "Starting 2048 game environment...") # Create game game = Game2048() game.reset() await send_progress(50, "Setting up game board...") # Set game on hubs setup_hub.env = game evaluate_hub.env = game # Mount hubs mcp.mount(setup_hub) mcp.mount(evaluate_hub) await send_progress(70, "Configuring tools...") # Add move tool mcp.add_tool(MoveTool(env=game)) await send_progress(100, "2048 environment ready") ``` ### remote\_browser Environment From `environments/remote_browser/src/hud_controller/server.py`: ```python from hud.server import MCPServer from hud.tools.computer import HudComputerTool, AnthropicComputerTool, OpenAIComputerTool from .tools import PlaywrightToolWithMemory, BrowserExecutor from .setup import setup as setup_hub from .evaluate import evaluate as evaluate_hub from .providers import get_provider mcp = MCPServer( name="HUD Remote Browser Environment", instructions="""Remote browser automation environment...""" ) # Global state browser_provider = None playwright_tool = None @mcp.resource("telemetry://live") async def get_telemetry_resource(): """MCP resource with live browser status.""" return { "provider": os.getenv("BROWSER_PROVIDER", "unknown"), "status": "running" if browser_provider else "stopped", "live_url": browser_provider.get_live_view_url() if browser_provider else None, "cdp_url": browser_provider.cdp_url if browser_provider else None } @mcp.initialize async def initialize_environment(ctx): global browser_provider, playwright_tool # Get metadata metadata = ctx.meta progress_token = metadata.get("progressToken", None) # Initialize provider provider_name = os.getenv("BROWSER_PROVIDER") provider_class = get_provider(provider_name) browser_provider = provider_class(config) # Launch browser cdp_url = await browser_provider.launch() # Create playwright tool playwright_tool = PlaywrightToolWithMemory(cdp_url=cdp_url) await playwright_tool._ensure_browser() # Add playwright tool (not a BaseTool, added directly) mcp.add_tool(playwright_tool) # Create computer tools executor = BrowserExecutor(playwright_tool) tool_kwargs = {"executor": executor} # Add display dimensions from metadata if metadata: width = metadata.get("display_width") height = metadata.get("display_height") if width and height: tool_kwargs["width"] = width tool_kwargs["height"] = height # Add computer tools (all are BaseTool subclasses) mcp.add_tool(HudComputerTool(**tool_kwargs)) mcp.add_tool(AnthropicComputerTool(**tool_kwargs)) mcp.add_tool(OpenAIComputerTool(**tool_kwargs)) # Mount hubs setup_hub.env = playwright_tool evaluate_hub.env = playwright_tool mcp.mount(setup_hub) mcp.mount(evaluate_hub) @mcp.shutdown async def shutdown_environment(): """Cleanup browser resources.""" global browser_provider if browser_provider: browser_provider.close() browser_provider = None ``` ## Standard Structure ### Directory Layout ``` my-environment/ ├── Dockerfile ├── pyproject.toml ├── README.md └── src/ └── hud_controller/ # Standard package name ├── __init__.py ├── server.py # Entry point with MCPServer ├── context.py # Environment state (optional) ├── setup/ # Setup hub (optional) │ ├── __init__.py # Creates BaseHub("setup") │ └── *.py # Setup functions ├── evaluate/ # Evaluate hub (optional) │ ├── __init__.py # Creates BaseHub("evaluate") │ └── *.py # Evaluator functions └── tools/ # Custom tools (optional) └── *.py # Tool implementations ``` ### Dockerfile ```dockerfile FROM python:3.11-slim WORKDIR /app # Install dependencies COPY pyproject.toml ./ RUN pip install -e . # Copy source COPY src/ ./src/ # Critical: stdio transport for MCP CMD ["python", "-m", "hud_controller.server"] ``` ### Hub Module Pattern Example from text\_2048: ```python # src/hud_controller/setup/__init__.py from hud.tools.base import BaseHub setup = BaseHub("setup") # Import all setup functions to register them from . import board __all__ = ["setup"] # src/hud_controller/setup/board.py from . import setup @setup.tool("board") async def setup_board(board_size: int = 4): """Initialize game board.""" game = setup.env # Access environment from hub game.reset(size=board_size) return [TextContent(text=f"{board_size}x{board_size} game initialized")] ``` ## Key Concepts ### Environment State Three patterns for managing state: 1. **Global variables** (simple environments): ```python game = None @mcp.initialize async def initialize_environment(ctx): global game game = Game2048() ``` 2. **Context class** (complex environments): ```python class EnvironmentContext: def __init__(self): self.browser = None self.page = None env = EnvironmentContext() ``` 3. **Hub env attribute** (for tool access): ```python setup_hub.env = game # Tools access via hub.env ``` ### Tool Lifecycle 1. **Setup tools** - Hidden from agents, prepare environment state 2. **Interaction tools** - Available to agents for control 3. **Evaluate tools** - Hidden from agents, score performance ### Progress Notifications Send [progress updates](https://modelcontextprotocol.io/specification/basic/utilities/progress) during long-running operations: ```python async def send_progress(progress: int, message: str): if progress_token: await ctx.session.send_progress_notification( progress_token=progress_token, progress=progress, total=100, message=message ) ``` Progress notifications follow the [MCP progress specification](https://modelcontextprotocol.io/specification/basic/utilities/progress#progress-flow). The `progressToken` comes from the client's request [metadata](https://modelcontextprotocol.io/specification/basic/index#_meta). ### Metadata Access Agent metadata flows through initialization: ```python @mcp.initialize async def initialize_environment(ctx): # From agent's metadata class variable width = ctx.meta.get("display_width", 1920) if ctx.meta else 1920 height = ctx.meta.get("display_height", 1080) if ctx.meta else 1080 ``` ## Testing ```bash # CLI testing hud debug my-env:latest hud analyze my-env:latest # Python testing async def test(): from hud.clients import MCPClient client = MCPClient({ "env": {"command": "docker", "args": ["run", "-i", "my-env"]} }) async with client: tools = await client.list_tools() result = await client.call_tool("setup", {"value": 0}) ``` ## See Also * [Build Environments](/build-environments) - Conceptual guide * [Tools](/reference/tools) - Tool implementation reference * [Advanced Design](/build-environments/registry) - Hub pattern details # Tasks Source: https://docs.hud.so/reference/tasks SDK reference for task configuration and dataset utilities The HUD SDK provides the `Task` class for defining agent objectives and dataset utilities for managing task collections. ## Task Class ```python from hud.datasets import Task ``` Pydantic model that defines an agent's objective, setup, and evaluation criteria. **Fields:** | Field | Type | Description | Default | | --------------- | ------------------------------------------ | ------------------------------------ | -------- | | `id` | `str \| None` | Unique identifier (UUID recommended) | `None` | | `prompt` | `str` | Task instruction for the agent | Required | | `mcp_config` | `dict[str, Any]` | MCP server configuration | Required | | `setup_tool` | `MCPToolCall \| list[MCPToolCall] \| None` | Tool(s) to prepare environment | `None` | | `evaluate_tool` | `MCPToolCall \| list[MCPToolCall] \| None` | Tool(s) to score performance | `None` | | `system_prompt` | `str \| None` | Additional system prompt | `None` | | `metadata` | `dict[str, Any]` | Extra task metadata | `{}` | ### Environment Variable Substitution The `mcp_config` field automatically resolves environment variables using `${VAR_NAME}` syntax: ```python task = Task( prompt="Navigate to the dashboard", mcp_config={ "browser": { "url": "${HUD_MCP_URL:https://mcp.hud.so/v3/mcp}", "headers": { "Authorization": "Bearer ${HUD_API_KEY}", "Mcp-Image": "hudpython/hud-browser:latest" } } } ) ``` **Substitution happens automatically when Task is created from a dict** - environment variables are resolved using Python's `Template.substitute()` with a defaultdict that returns empty strings for missing variables. ### Field Validators Task automatically: 1. **Parses JSON strings** - `mcp_config` and `metadata` can be JSON strings 2. **Converts dicts to MCPToolCall** - `setup_tool` and `evaluate_tool` dicts are converted 3. **Resolves environment variables** - Only when created from dict (preserves templates in model\_dump()) ```python # All of these work: task = Task( prompt="Test", mcp_config='{"server": {"url": "..."}}', # JSON string → dict setup_tool='{"name": "setup"}' # JSON string → MCPToolCall ) task = Task( prompt="Test", mcp_config={...}, setup_tool={"name": "setup", "arguments": {...}} # dict → MCPToolCall ) task = Task( prompt="Test", mcp_config={...}, evaluate_tool=[ # List of tool calls {"name": "check_text", "arguments": {"text": "Success"}}, {"name": "check_url", "arguments": {"url": "example.com"}} ] ) ``` ## Dataset Functions ### run\_dataset ```python from hud.datasets import run_dataset results = await run_dataset( name="My Evaluation", dataset="hud-evals/SheetBench-50", # or Dataset object or list[dict] agent_class=ClaudeAgent, agent_config={"model": "claude-3-5-sonnet"}, max_concurrent=50, max_steps=40 ) ``` Run all tasks in a dataset with automatic parallelization and job tracking. **Parameters:** | Parameter | Type | Description | Default | | ---------------------- | ------------------------------ | -------------------------------------------- | --------- | | `name` | `str` | Job name for tracking | Required | | `dataset` | `str \| Dataset \| list[dict]` | HF dataset ID, Dataset object, or task dicts | Required | | `agent_class` | `type[MCPAgent]` | Agent class to instantiate | Required | | `agent_config` | `dict[str, Any] \| None` | Constructor kwargs for agent | `None` | | `max_concurrent` | `int` | Maximum parallel tasks | `50` | | `metadata` | `dict[str, Any] \| None` | Job metadata | `None` | | `max_steps` | `int` | Max steps per task | `40` | | `split` | `str` | Dataset split when loading by ID | `"train"` | | `auto_respond` | `bool` | Use ResponseAgent for continuations | `False` | | `custom_system_prompt` | `str \| None` | Override dataset system prompt | `None` | **Returns:** `list[Trace]` - Results in dataset order **Features:** * Automatic job tracking with `hud.job()` * System prompt loading from dataset's `system_prompt.txt` * Parallel execution with semaphore control * Progress tracking with individual traces * Graceful error handling (continues on failures) **Example:** ```python # From HuggingFace dataset ID results = await run_dataset( "SheetBench Evaluation", "hud-evals/SheetBench-50", ClaudeAgent, {"model": "claude-3-5-sonnet-20241022"} ) # From Dataset object from datasets import load_dataset dataset = load_dataset("hud-evals/OSWorld-Verified", split="train") results = await run_dataset("OSWorld Test", dataset, OperatorAgent) # From list of dicts tasks = [ { "prompt": "Click submit", "mcp_config": {...}, "setup_tool": {"name": "setup"} } ] results = await run_dataset("Custom", tasks, ClaudeAgent) ``` ### fetch\_system\_prompt\_from\_dataset ```python from hud.datasets import fetch_system_prompt_from_dataset system_prompt = await fetch_system_prompt_from_dataset("hud-evals/SheetBench-50") ``` Fetch `system_prompt.txt` from a HuggingFace dataset repository. **Returns:** `str | None` - System prompt text if found **Note:** Requires `huggingface_hub` to be installed. ### save\_tasks ```python from hud.datasets import save_tasks # IMPORTANT: Pass dicts, not Task objects! task_dicts = [task.model_dump() for task in tasks] save_tasks( tasks=task_dicts, repo_id="my-org/my-tasks", private=False, dataset_name="My Tasks v1.0" ) ``` Save tasks to HuggingFace dataset with JSON string serialization. **Parameters:** | Parameter | Type | Description | Default | | ---------- | ---------------------- | ------------------------------------ | -------- | | `tasks` | `list[dict[str, Any]]` | Task dictionaries (NOT Task objects) | Required | | `repo_id` | `str` | HuggingFace repository ID | Required | | `**kwargs` | `Any` | Additional args for `push_to_hub()` | - | **Important:** Always pass dictionaries to preserve environment variable templates: ```python # ✅ GOOD - Preserves ${HUD_API_KEY} in the dataset task_dicts = [task.model_dump() for task in tasks] save_tasks(task_dicts, "my-dataset") # ❌ BAD - Would expose resolved API keys! # save_tasks(tasks, "my-dataset") # Don't do this! ``` **Dataset Schema:** The function converts complex fields to JSON strings for clean HuggingFace schema: * `mcp_config` → JSON string * `setup_tool` → JSON string (if present) * `evaluate_tool` → JSON string (if present) * `metadata` → JSON string (if present) ## MCPToolCall Type ```python from hud.types import MCPToolCall ``` Defines a tool invocation with arguments. **Fields:** | Field | Type | Description | Default | | ----------- | ---------------- | ----------------- | -------- | | `name` | `str` | Tool name to call | Required | | `arguments` | `dict[str, Any]` | Tool arguments | `{}` | ## Real-World Examples ### Loading Tasks from Datasets From `examples/run_evaluation.py`: ```python # Load dataset dataset = load_dataset("hud-evals/SheetBench-50", split="train") # Get system prompt dataset_system_prompt = await fetch_system_prompt_from_dataset("hud-evals/SheetBench-50") # Run single task sample_task = dataset[0] task = Task(**sample_task) # Auto-converts dict to Task # Or run all tasks results = await run_dataset( "Full SheetBench Evaluation", dataset, ClaudeAgent, {"model": "claude-3-5-sonnet-20241022"}, max_concurrent=50 ) ``` ### Task Structure in Datasets From `environments/text_2048/2048_taskconfigs.json`: ```json { "id": "2048_target_256", "prompt": "You are playing 2048 and your goal is to reach the 256 tile.", "mcp_config": { "local": { "command": "docker", "args": ["run", "--rm", "-i", "hud-text-2048"] } }, "setup_tool": { "name": "setup", "arguments": { "name": "board", "arguments": {"board_size": 4} } }, "evaluate_tool": { "name": "evaluate", "arguments": { "name": "max_number", "arguments": {"target": 256} } }, "metadata": { "difficulty": 256, "game": "2048" } } ``` ### Creating and Saving Tasks ```python import os from uuid import uuid4 # Set environment variables os.environ["HUD_API_KEY"] = "sk-hud-..." os.environ["BROWSER_PROVIDER"] = "anchorbrowser" # Create tasks with env var templates tasks = [] for url in ["example.com", "test.com"]: task = Task( id=str(uuid4()), prompt=f"Navigate to {url} and find the login button", mcp_config={ "browser": { "url": "${HUD_MCP_URL:https://mcp.hud.so/v3/mcp}", "headers": { "Authorization": "Bearer ${HUD_API_KEY}", "X-Provider": "${BROWSER_PROVIDER}" } } }, setup_tool=MCPToolCall( name="setup", arguments={"name": "navigate", "arguments": {"url": url}} ), evaluate_tool=MCPToolCall( name="evaluate", arguments={"name": "element_exists", "arguments": {"selector": "button.login"}} ), metadata={ "category": "navigation", "difficulty": "easy" } ) tasks.append(task) # Save to HuggingFace (preserves env var templates) task_dicts = [t.model_dump() for t in tasks] save_tasks(task_dicts, "my-org/navigation-tasks") ``` ### Agent Integration Tasks automatically configure agents: ```python task = Task( prompt="Complete the form", mcp_config={...}, setup_tool={...}, evaluate_tool={...}, system_prompt="Be very careful with form fields" ) agent = ClaudeAgent() # When agent.run(task) is called: # 1. If no mcp_client, creates one from task.mcp_config # 2. Adds task.system_prompt to agent's system prompt # 3. Adds lifecycle tools to agent.lifecycle_tools # 4. Runs setup_tool (hidden from agent) # 5. Executes task.prompt # 6. Runs evaluate_tool (hidden from agent) # 7. Returns Trace with evaluation reward ``` ## Best Practices 1. **Use UUIDs for task IDs** - Required for HuggingFace datasets 2. **Save dictionaries, not objects** - Preserves env var templates 3. **Include system\_prompt.txt** - Upload to dataset repo root 4. **Use metadata for filtering** - Category, difficulty, tags 5. **Test locally first** - Before uploading to HuggingFace 6. **Version your datasets** - Use meaningful repo names ## Common Patterns ### Filtering Tasks ```python # Load all tasks dataset = load_dataset("hud-evals/WebArena-Lite", split="train") # Filter in Python login_tasks = [ task for task in dataset if "login" in task["prompt"].lower() ] # Convert and run results = await run_dataset( "Login Tasks Only", login_tasks, ClaudeAgent ) ``` ### Custom System Prompts ```python # Override dataset system prompt results = await run_dataset( "Custom Evaluation", "hud-evals/SheetBench-50", ClaudeAgent, custom_system_prompt="You are an expert spreadsheet user. Be precise." ) ``` ### Environment Variable Management ```python # Tasks preserve templates task = Task( prompt="Test", mcp_config={ "server": { "api_key": "${API_KEY:default-key}" } } ) # model_dump() preserves the template data = task.model_dump() assert data["mcp_config"]["server"]["api_key"] == "${API_KEY:default-key}" # But task object has resolved value os.environ["API_KEY"] = "real-key" task2 = Task(**data) # task2.mcp_config["server"]["api_key"] is now "real-key" ``` ## See Also * [Task System](/core-concepts/task-system) - Conceptual overview * [Create Benchmarks](/evaluate-agents/create-benchmarks) - Building datasets * [Agents](/reference/agents) - How agents use tasks # Tools Source: https://docs.hud.so/reference/tools SDK reference for HUD tools and executors The HUD SDK provides a comprehensive set of tools for environment interaction. All tools inherit from `BaseTool` and return standardized MCP content blocks. ## Base Classes ### BaseTool ```python from hud.tools import BaseTool ``` Abstract base class for all MCP tools. Provides standardized output formatting and MCP registration. **Constructor Parameters:** | Parameter | Type | Description | Default | | ------------- | ----- | --------------------------------------------------- | ------------------- | | `env` | `Any` | Optional stateful context (game, executor, browser) | `None` | | `name` | `str` | Tool name for MCP registration | Auto from class | | `title` | `str` | Human-readable display name | Auto from class | | `description` | `str` | Tool description | Auto from docstring | **Abstract Methods:** ```python async def __call__(self, **kwargs: Any) -> list[ContentBlock] ``` **Properties:** * `mcp` - FastMCP FunctionTool wrapper for registration **Usage:** ```python class MyTool(BaseTool): async def __call__(self, param: str) -> list[ContentBlock]: result = await self.env.do_something(param) return [TextContent(text=result, type="text")] # MCPServer automatically handles BaseTool instances tool = MyTool(env=my_context) mcp_server.add_tool(tool) # No need for .mcp property ``` ### BaseHub ```python from hud.tools import BaseHub ``` A composition-friendly FastMCP server for organizing tools into nested namespaces. **Key Features:** * Internal tool dispatcher pattern * Automatic resource catalog generation * Hidden internal tools from MCP clients **Usage:** ```python hub = BaseHub("evaluators") @hub.tool() # Internal tool, hidden from agents async def url_match(url: str) -> EvaluationResult: return EvaluationResult(reward=1.0, done=True) # Access via dispatcher when mounted on server # Agents call: evaluators(name="url_match", arguments={"url": "..."}) ``` ## Core Tools ### BashTool ```python from hud.tools import BashTool ``` Execute bash commands in a persistent shell session. **Constructor Parameters:** | Parameter | Type | Description | Default | | --------- | -------------- | --------------------------- | ------- | | `session` | `_BashSession` | Pre-configured bash session | `None` | **Tool Parameters:** | Parameter | Type | Description | Default | | --------- | ------ | ------------------------ | -------- | | `command` | `str` | Bash command to execute | Required | | `restart` | `bool` | Restart the bash session | `False` | **Example:** ```python bash = BashTool() result = await bash(command="ls -la") # Returns: [TextContent(text="file1.txt\nfile2.txt", type="text")] # Restart session await bash(restart=True) ``` ### EditTool ```python from hud.tools import EditTool ``` File system editor with undo support. **Constructor Parameters:** | Parameter | Type | Description | Default | | -------------- | ----------------------- | --------------------- | ------- | | `file_history` | `dict[Path, list[str]]` | Edit history per file | `None` | **Commands:** | Command | Parameters | Description | | ------------- | -------------------------------- | ------------------------------- | | `view` | `path`, `view_range` | View file or directory contents | | `create` | `path`, `file_text` | Create new file | | `str_replace` | `path`, `old_str`, `new_str` | Replace string in file | | `insert` | `path`, `insert_line`, `new_str` | Insert text at line | | `undo_edit` | `path` | Undo last edit | **Example:** ```python editor = EditTool() # View file await editor(command="view", path="/path/to/file.py", view_range=[1, 20]) # Create file await editor(command="create", path="/new/file.py", file_text="print('hello')") # Replace text await editor(command="str_replace", path="/file.py", old_str="old_text", new_str="new_text") ``` ### PlaywrightTool ```python from hud.tools import PlaywrightTool ``` Web automation using Playwright browser. **Constructor Parameters:** | Parameter | Type | Description | Default | | --------- | ------ | ---------------------------- | ------- | | `page` | `Page` | Existing Playwright page | `None` | | `cdp_url` | `str` | Chrome DevTools Protocol URL | `None` | **Actions:** | Action | Parameters | Description | | ------------------ | ---------------------------- | -------------------------- | | `navigate` | `url`, `wait_for_load_state` | Navigate to URL | | `screenshot` | - | Capture page screenshot | | `click` | `selector` | Click element | | `type` | `selector`, `text` | Type text in element | | `get_page_info` | - | Get page title and URL | | `wait_for_element` | `selector` | Wait for element to appear | **Example:** ```python tool = PlaywrightTool() # Navigate await tool(action="navigate", url="https://example.com") # Click button await tool(action="click", selector="button.submit") # Type text await tool(action="type", selector="input#search", text="query") ``` ## Computer Control Tools ### HudComputerTool ```python from hud.tools import HudComputerTool ``` Universal computer control with automatic scaling and executor selection. **Constructor Parameters:** | Parameter | Type | Description | Default | | ---------------- | -------------------------------- | ------------------- | ----------- | | `executor` | `BaseExecutor` | Executor to use | Auto-detect | | `platform_type` | `"auto"`, `"xdo"`, `"pyautogui"` | Executor type | `"auto"` | | `display_num` | `int` | X display number | From env | | `width` | `int` | Agent screen width | 1280 | | `height` | `int` | Agent screen height | 720 | | `rescale_images` | `bool` | Rescale screenshots | `True` | **Actions:** | Action | Parameters | Description | | ------------ | ------------------------------------------ | --------------------- | | `screenshot` | - | Capture screen | | `click` | `x`, `y`, `button`, `pattern`, `hold_keys` | Mouse click | | `write` | `text`, `enter_after`, `delay` | Type text | | `press` | `keys` | Press key combination | | `scroll` | `x`, `y`, `scroll_x`, `scroll_y` | Scroll | | `drag` | `start_x`, `start_y`, `end_x`, `end_y` | Drag mouse | **Example:** ```python computer = HudComputerTool() # Take screenshot await computer(action="screenshot") # Click at coordinates await computer(action="click", x=100, y=200) # Type text await computer(action="write", text="Hello World", enter_after=True) # Press hotkey await computer(action="press", keys=["ctrl", "c"]) ``` ### AnthropicComputerTool ```python from hud.tools import AnthropicComputerTool ``` Computer control optimized for Anthropic's Claude models. **Features:** * Pre-configured for 1280x720 resolution * Optimized action names for Claude * Built-in screenshot scaling ### OpenAIComputerTool ```python from hud.tools import OpenAIComputerTool ``` Computer control optimized for OpenAI models. **Features:** * Pre-configured for 1920x1080 resolution * Simplified action interface * No automatic screenshot scaling ## Executors Executors provide platform-specific implementations for computer control actions. ### BaseExecutor ```python from hud.tools.executors import BaseExecutor ``` Abstract base providing simulation mode for all actions. **Core Methods:** * `click(x, y, button, pattern, hold_keys)` - Mouse click * `write(text, enter_after, delay)` - Type text * `press(keys)` - Press key combination * `scroll(x, y, scroll_x, scroll_y)` - Scroll * `drag(start_x, start_y, end_x, end_y)` - Drag mouse * `screenshot()` - Capture screen * `get_screen_size()` - Get display dimensions ### PyAutoGUIExecutor ```python from hud.tools.executors import PyAutoGUIExecutor ``` Cross-platform executor using PyAutoGUI library. **Features:** * Works on Windows, macOS, Linux * Real mouse/keyboard control * Screenshot capture * Automatic failsafe **Example:** ```python executor = PyAutoGUIExecutor() computer = HudComputerTool(executor=executor) ``` ### XDOExecutor ```python from hud.tools.executors import XDOExecutor ``` Linux/X11 executor using xdotool. **Features:** * Native X11 integration * Faster than PyAutoGUI on Linux * Support for X display selection * Window management capabilities **Example:** ```python executor = XDOExecutor(display_num=1) computer = HudComputerTool(executor=executor) ``` ## Common Types ### ContentBlock MCP standard output format (from `mcp.types`): ```python from mcp.types import TextContent, ImageContent # Text output TextContent(text="Operation complete", type="text") # Image output ImageContent(data="base64_data", mimeType="image/png", type="image") ``` ### EvaluationResult ```python from hud.tools.types import EvaluationResult result = EvaluationResult( reward=0.8, # Score 0-1 done=True, # Task complete content="Details", # Optional text info={"score": 80} # Metadata ) ``` ### ContentResult ```python from hud.tools.types import ContentResult # Helper for building complex outputs result = ContentResult( output="Success message", error="Error if any", base64_image="screenshot_data", system="System message" ) # Convert to ContentBlocks blocks = result.to_content_blocks() ``` ## Integration Examples ### Adding Tools to Environment ```python from hud.server import MCPServer from hud.tools import BashTool, EditTool mcp = MCPServer(name="my-env") # MCPServer handles BaseTool instances automatically bash = BashTool() mcp.add_tool(bash) # Internally uses bash.mcp editor = EditTool() mcp.add_tool(editor) # Same here ``` ### Custom Tool Implementation ```python from hud.tools import BaseTool from mcp.types import TextContent class DatabaseTool(BaseTool): def __init__(self, db_connection): super().__init__( env=db_connection, name="database", title="Database Query Tool", description="Execute SQL queries" ) async def __call__(self, query: str) -> list[ContentBlock]: try: results = await self.env.execute(query) return [TextContent(text=str(results), type="text")] except Exception as e: return [TextContent(text=f"Error: {e}", type="text")] ``` ### Hub Pattern for Evaluators ```python from hud.tools import BaseHub from hud.tools.types import EvaluationResult evaluators = BaseHub("evaluate") @evaluators.tool("text_contains") async def check_text(text: str, target: str) -> EvaluationResult: return EvaluationResult( reward=1.0 if target in text else 0.0, done=True, content=f"Checking if '{target}' in text" ) # Use in environment @mcp.tool() async def evaluate(name: str, **kwargs): return await evaluators.call_tool(name, kwargs) ``` ## See Also * [Build Environments](/build-environments) - Using tools in environments * [Architecture](/core-concepts/architecture) - How tools fit in HUD * [Adapting Software](/build-environments/adapting-software) - Tool patterns # Datasets Source: https://docs.hud.so/train-agents/datasets Create and manage training datasets Training datasets are collections of HUD tasks stored on HuggingFace. Each task defines a prompt, MCP configuration, and evaluation criteria. ## Creating Datasets ```python from hud.datasets import Task, save_tasks from hud.types import MCPToolCall # Create task variations tasks = [] for i in range(100): tasks.append(Task( prompt=f"Reach the {2**(9+i%4)} tile in 2048", mcp_config={"hudpython/hud-text-2048:v1.2": {}}, setup_tool="setup_board", evaluate_tool=MCPToolCall( name="evaluate_max_tile", arguments={"target": 2**(9+i%4)} ) )) # Save to HuggingFace save_tasks(tasks, "my-org/2048-training") ``` ## Dataset Format Tasks are stored as HuggingFace datasets with these fields: ```python { "id": "task-001", "prompt": "Navigate to login page", "mcp_config": '{"hudpython/hud-browser:latest": {}}', # JSON string "setup_tool": '{"name": "clear_cookies"}', # JSON string "evaluate_tool": '{"name": "check_logged_in"}', # JSON string "metadata": '{"difficulty": "easy"}' # JSON string } ``` ## Loading Datasets ```python from datasets import load_dataset from hud.datasets import Task # Load from HuggingFace dataset = load_dataset("hud-evals/2048-taskset", split="train") tasks = [Task(**task_dict) for task_dict in dataset] # Filter tasks easy_tasks = [t for t in tasks if t.metadata.get("difficulty") == "easy"] # Use in training env = vf.load_environment( env_id="hud-vf-gym", taskset="hud-evals/2048-taskset", config_path="./configs/2048.yaml", num_tasks=50 # Sample 50 tasks ) ``` ## Curriculum Learning Create staged datasets for progressive training: ```python # Stage 1: Simple moves save_tasks(create_simple_tasks(), "my-org/2048-stage1") # Stage 2: Strategic play save_tasks(create_medium_tasks(), "my-org/2048-stage2") # Stage 3: Expert level save_tasks(create_hard_tasks(), "my-org/2048-stage3") ``` ## Best Practices * **Diversity**: Include varied prompts and scenarios * **Balance**: Mix difficulty levels appropriately * **Size**: 100-1000 tasks per dataset typically * **Validation**: Keep 10-20% for evaluation ## Example Datasets * [hud-evals/2048-taskset](https://huggingface.co/datasets/hud-evals/2048-taskset) - 2048 game variations * [hud-evals/sheetbench-50](https://huggingface.co/datasets/hud-evals/sheetbench-50) - Spreadsheet tasks # Train Agents Source: https://docs.hud.so/train-agents/index Improve agent performance with reinforcement learning Train agents to perform better on specific tasks using reinforcement learning. Small models trained on your tasks can outperform larger general models. ## Overview HUD integrates with [Verifiers](https://github.com/willccbb/verifiers) for GRPO (Group Relative Policy Optimization) training: ```python import verifiers as vf # Load environment env = vf.load_environment( env_id="hud-vf-gym", taskset="hud-evals/2048-taskset", config_path="configs/2048.yaml" ) # Train with GRPO model, tokenizer = vf.get_model_and_tokenizer("Qwen/Qwen2.5-3B-Instruct") trainer = vf.GRPOTrainer( model=model, env=env, args=vf.grpo_defaults(), peft_config=vf.lora_defaults() # LoRA for efficiency ) trainer.train() ``` ## Why Train? * **Task Performance**: 3B model can reach 80% success vs 35% baseline * **Cost**: Smaller models are cheaper to run * **Speed**: Trained models complete tasks faster * **Consistency**: Less variance in behavior ## Requirements
Model Size GPU Memory Training Time (100 steps)
3B 24GB (1x RTX 4090) 30 minutes
7B 40GB (1x A100) 1 hour
70B 320GB (4x A100) 6+ hours
## Monitoring & Analytics Track training progress in real-time: * **[app.hud.so](https://app.hud.so)** - View traces, metrics, and rewards for each training generation * **Weights & Biases** - Detailed ML metrics and loss curves * **Local logs** - Training checkpoints and evaluation results ## Getting Started Train your first agent in 30 minutes Configure tools with YAML Create training datasets # RL Quickstart Source: https://docs.hud.so/train-agents/rl-quickstart Train agents with reinforcement learning using GRPO Train agents to perform better on specific tasks using GRPO (Group Relative Policy Optimization). ## Quick Start ### Prerequisites * GPU with 24GB+ VRAM (A100, RTX 4090) * Python 3.12+ * HUD API key (set `HUD_API_KEY` environment variable) ### Installation ```bash # Clone repository git clone https://github.com/hud-evals/hud-python cd hud-python/rl # Clone hud-vf-gym (required) git clone https://github.com/hud-evals/hud-vf-gym.git # Install with uv (recommended) uv sync source .venv/bin/activate ``` ### Train 2048 Agent ```bash # Set API key export HUD_API_KEY="YOUR_API_KEY" # Build the 2048 environment cd ../../environments/text_2048/ docker build -t hud-text-2048 . cd ../../rl/ ``` ```bash # Terminal 1 - GPU 0 CUDA_VISIBLE_DEVICES=0 vf-vllm \ --model Qwen/Qwen2.5-3B-Instruct \ --enforce-eager \ --disable-log-requests ``` ```bash # Terminal 2 - GPU 1 CUDA_VISIBLE_DEVICES=1 python train_2048.py ``` The included `train_2048.py` script trains a 3B model on the 2048 game. Training takes \~30 minutes for 100 steps. ![RL Training Progress](https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/src/images/rl_2.png) *Qwen-2.5-3B agent training on the [text-2048](https://github.com/hud-evals/hud-python/tree/main/environments/text_2048) environment using GRPO* ## How It Works GRPO trains by comparing multiple attempts on the same task: ```mermaid graph LR A[Task] --> B[Generate 16 Attempts] B --> C[Rank by Reward] C --> D[Learn from Best] D --> E[Improved Model] ``` ## Configuration Training uses YAML configs to map agent tools to MCP tools.: ```yaml # From rl/configs/2048.yaml system_prompt: | You are an excellent 2048 player. Available moves: left(), right(), up(), down(), done() Return ONLY: move() action_mappings: left: _tool: "move" direction: {static: "left"} right: _tool: "move" direction: {static: "right"} # ... up, down similar ``` ## Custom Training ```python import verifiers as vf # Load environment env = vf.load_environment( env_id="hud-vf-gym", taskset="your-org/your-taskset", # HuggingFace dataset config_path="configs/your-env.yaml", # In hud-vf-gym num_tasks=10 ) # Configure training (matching train_2048.py) model, tokenizer = vf.get_model_and_tokenizer("Qwen/Qwen2.5-3B-Instruct") args = vf.grpo_defaults(run_name="my-training") args.gradient_accumulation_steps = 2 args.per_device_train_batch_size = 8 args.num_generations = 16 args.max_steps = 100 args.learning_rate = 1e-6 # Train with LoRA trainer = vf.GRPOTrainer( model=model, processing_class=tokenizer, env=env, args=args, peft_config=vf.lora_defaults() ) trainer.train() ``` ## Monitoring ### HUD Dashboard View training progress and traces at [app.hud.so](https://app.hud.so): * Real-time screenshot and setp viewing * Tool call traces for each trajectory * Reward distribution across generations ### Weights & Biases For detailed ML metrics: ```python args.report_to = "wandb" os.environ["WANDB_API_KEY"] = "your-key" ``` ## Resources * [Example Training Script](https://github.com/hud-evals/hud-python/blob/main/rl/train_2048.py) * [HUD VF Gym](https://github.com/hud-evals/hud-vf-gym) - Verifiers integration # Config System Source: https://docs.hud.so/train-agents/verifiers-gym YAML-based tool configuration for training Verifiers Gym uses YAML configs to map agent-facing tools to MCP tools, enabling flexible environment configuration without code changes. ## Basic Configuration ```yaml configs/browser.yaml # Define agent interface system_prompt: | You are an AI assistant that can interact with a browser. Available tools: - screenshot(): Take a screenshot - click(x, y): Click at coordinates - type(text): Type text - done(): Complete the task Use function(args) format. # Map agent tools to MCP tools action_mappings: screenshot: _tool: "computer" action: "screenshot" click: _parser: positional: ["x", "y"] # Expected arguments _tool: "computer" action: "click" x: {from_arg: "x"} y: {from_arg: "y"} type: _parser: positional: ["text"] _tool: "computer" action: "type" text: {from_arg: "text"} ``` ## Advanced Features ### Argument Transformation Transform arguments before passing to MCP: ```yaml scroll: _parser: positional: ["direction", "amount"] _tool: "computer" action: "scroll" coordinate: [640, 400] # Static value direction: from_arg: "direction" amount: from_arg: "amount" transform: "lambda a: int(a * 120)" # Convert to pixels ``` ### Context-Aware Transforms Access other arguments in transforms: ```yaml key: _parser: positional: ["key"] _tool: "computer" action: "press" keys: from_arg: "key" transform: "lambda x: x.split('+')" # "ctrl+a" → ["ctrl", "a"] ``` ### Static Values Use fixed values instead of arguments: ```yaml # 2048 game directions left: _tool: "move" direction: {static: "left"} right: _tool: "move" direction: {static: "right"} ``` ## Parser Settings Control agent output parsing: ```yaml parser: use_thinking: false # Disable tags xml_weight: 0.6 # XML format validation weight action_weight: 0.4 # Action syntax validation weight rubric: weights: task_completion: 0.8 # Primary task reward tool_execution: 0.1 # Tool success rate format_compliance: 0.1 # Proper formatting ``` ## Usage ```python import verifiers as vf # Use your config (path relative to where you run the script) env = vf.load_environment( env_id="hud-vf-gym", taskset="your-org/your-taskset", config_path=configs/your-env.yaml" ) ``` ## Examples Config files are located in the `hud-python/rl/configs/`: * `2048.yaml` - 2048 game with directional moves * `default.yaml` - Browser/computer interaction