HUD makes it easy to evaluate any MCP-compatible agent on a variety of tasks and benchmarks. Whether you’re testing Claude, Operator, or your custom agent, the evaluation flow is consistent.

Quick Start

Evaluate an agent on a single task:
import hud
import os
from hud.datasets import Task
from hud.agents import ClaudeAgent

with hud.trace("eval-demo"):
    task = Task(
        prompt="Create a spreadsheet with quarterly sales data",
        mcp_config={
            "hud": {
                "url": "https://mcp.hud.so/v3/mcp",
                "headers": {
                    "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
                    "Mcp-Image": "hudpython/hud-browser:latest"
                }
            }
        },
        setup_tool={
            "name": "playwright",
            "arguments": {
                "action": "navigate",
                "url": "https://sheets.google.com"
            }
        },
        evaluate_tool={
            "name": "evaluate", 
            "arguments": {
                "name": "sheet_contains",
                "text": "Q1 Sales"
            }
        }
    )
    
    agent = ClaudeAgent()
    result = await agent.run(task)
    print(f"Success: {result.reward > 0.5}")

What You Can Do

1. Create Tasks for Browser Automation

import os

task = Task(
    prompt="Navigate to GitHub and star the HUD repository",
    mcp_config={
        "hud": {
            "url": "https://mcp.hud.so/v3/mcp",
            "headers": {
                "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
                "Mcp-Image": "hudpython/hud-browser:latest"
            }
        }
    },
    setup_tool={
        "name": "setup",
        "arguments": {
            "name": "navigate",
            "url": "https://github.com"
        }
    },
    evaluate_tool={
        "name": "evaluate", 
        "arguments": {
            "name": "url_contains",
            "substring": "hud-python"
        }
    }
)

2. Run Existing Benchmarks

from datasets import load_dataset
from hud.datasets import run_dataset

# Load and run SheetBench-50
dataset = load_dataset("hud-evals/sheetbench-50", split="train")
results = await run_dataset(
    "My SheetBench Run",
    dataset,
    agent_class=ClaudeAgent,
    max_concurrent=5
)

3. Make Your Agent Work with HUD

To create a custom agent, inherit from MCPAgent and implement the required methods:
from hud.agents import MCPAgent
from hud.types import AgentResponse, MCPToolCall

class MyCustomAgent(MCPAgent):
    async def get_response(self, messages: list[Any]) -> AgentResponse:
        # Call your LLM and return tool calls
        ...
    
    async def format_blocks(self, blocks: list[Any]) -> list[Any]:
        # Format content blocks into messages for your LLM
        ...
    
    async def format_tool_results(
        self, tool_calls: list[MCPToolCall], 
        tool_results: list[Any]
    ) -> list[Any]:
        # Format tool results back into messages
        ...

# Now it works with any HUD dataset!

Full Implementation Guide

See complete examples and implementation details

Available Benchmarks

SheetBench-50

50 real-world spreadsheet tasks testing data manipulation, formulas, and analysis

OSWorld-Verified

Desktop automation tasks across Ubuntu applications

OnlineMind2Web

Web navigation and interaction challenges (Coming Soon)

2048 Puzzles

Strategic planning in the 2048 game environment

Exploring Environments

Use hud analyze to discover available tools and evaluators:
$ hud analyze hudpython/hud-browser:latest

πŸ” Analyzing hudpython/hud-browser:latest...

πŸ“Š Environment Summary:
β”œβ”€β”€ Tools: 15 available
β”œβ”€β”€ Setup Functions: 8 available  
β”œβ”€β”€ Evaluators: 12 available
└── Resources: 3 available

πŸ› οΈ Tools:
β”œβ”€β”€ playwright(action: str, **kwargs) - Browser automation actions
β”œβ”€β”€ click(selector: str) - Click element
β”œβ”€β”€ type(selector: str, text: str) - Type text
└── ... 12 more tools

πŸ“‹ Evaluators:
β”œβ”€β”€ url_contains(substring: str) - Check if URL contains text
β”œβ”€β”€ page_contains(text: str, regex: bool = False) - Check page content
β”œβ”€β”€ element_exists(selector: str) - Check if CSS selector exists
β”œβ”€β”€ todo_completed(expected_count: int) - Verify TODO completion
└── ... 8 more evaluators

Run with --json for full details or pipe to grep for filtering.

Publishing to Leaderboards

After running evaluations, view results on the leaderboard:
# Run evaluation
results = await run_dataset(
    "Claude-3.5 SheetBench",
    dataset="hud-evals/sheetbench-50",
    agent_class=ClaudeAgent
)

# Then visit: app.hud.so/leaderboards/hud-evals/sheetbench-50
# Click "My Jobs" to see your runs and create scorecards

Key Features

  • Reproducible: Docker environments ensure consistency
  • Parallel: Run multiple evaluations concurrently
  • Observable: Every tool call tracked with telemetry
  • Extensible: Easy to add new tasks or benchmarks

Next Steps