Evaluate Agents

HUD makes it easy to evaluate any MCP-compatible agent on a variety of tasks and benchmarks. Whether you’re testing Claude, Operator, or your custom agent, the evaluation flow is consistent.

Quick Start

Evaluate an agent on a single task:

import hud
import os
from hud.datasets import Task
from hud.agents import ClaudeAgent

with hud.trace("eval-demo"):
    task = Task(
        prompt="Create a spreadsheet with quarterly sales data",
        mcp_config={
            "hud": {
                "url": "https://mcp.hud.so/v3/mcp",
                "headers": {
                    "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
                    "Mcp-Image": "hudpython/hud-browser:latest"
                }
            }
        },
        setup_tool={
            "name": "playwright",
            "arguments": {
                "action": "navigate",
                "url": "https://sheets.google.com"
            }
        },
        evaluate_tool={
            "name": "evaluate", 
            "arguments": {
                "name": "sheet_contains",
                "text": "Q1 Sales"
            }
        }
    )
    
    agent = ClaudeAgent()
    result = await agent.run(task)
    print(f"Success: {result.reward > 0.5}")

What You Can Do

1. Create Tasks for Browser Automation

import os

task = Task(
    prompt="Navigate to GitHub and star the HUD repository",
    mcp_config={
        "hud": {
            "url": "https://mcp.hud.so/v3/mcp",
            "headers": {
                "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
                "Mcp-Image": "hudpython/hud-browser:latest"
            }
        }
    },
    setup_tool={
        "name": "setup",
        "arguments": {
            "name": "navigate",
            "url": "https://github.com"
        }
    },
    evaluate_tool={
        "name": "evaluate", 
        "arguments": {
            "name": "url_contains",
            "substring": "hud-python"
        }
    }
)

2. Run Existing Benchmarks

from datasets import load_dataset
from hud.datasets import run_dataset

# Load and run SheetBench-50
dataset = load_dataset("hud-evals/sheetbench-50", split="train")
results = await run_dataset(
    "My SheetBench Run",
    dataset,
    agent_class=ClaudeAgent,
    max_concurrent=5
)

3. Make Your Agent Work with HUD

To create a custom agent, inherit from MCPAgent and implement the required methods:

from hud.agents import MCPAgent
from hud.types import AgentResponse, MCPToolCall

class MyCustomAgent(MCPAgent):
    async def get_response(self, messages: list[Any]) -> AgentResponse:
        # Call your LLM and return tool calls
        ...
    
    async def format_blocks(self, blocks: list[Any]) -> list[Any]:
        # Format content blocks into messages for your LLM
        ...
    
    async def format_tool_results(
        self, tool_calls: list[MCPToolCall], 
        tool_results: list[Any]
    ) -> list[Any]:
        # Format tool results back into messages
        ...

# Now it works with any HUD dataset!

Full Implementation Guide

See complete examples and implementation details

Available Benchmarks

SheetBench-50

50 real-world spreadsheet tasks testing data manipulation, formulas, and analysis

OSWorld-Verified

Desktop automation tasks across Ubuntu applications

OnlineMind2Web

Web navigation and interaction challenges (Coming Soon)

2048 Puzzles

Strategic planning in the 2048 game environment

Exploring Environments

Use hud analyze to discover available tools and evaluators:

$ hud analyze hudpython/hud-browser:latest

🔍 Analyzing hudpython/hud-browser:latest...

📊 Environment Summary:
├── Tools: 15 available
├── Setup Functions: 8 available  
├── Evaluators: 12 available
└── Resources: 3 available

🛠️ Tools:
├── playwright(action: str, **kwargs) - Browser automation actions
├── click(selector: str) - Click element
├── type(selector: str, text: str) - Type text
└── ... 12 more tools

📋 Evaluators:
├── url_contains(substring: str) - Check if URL contains text
├── page_contains(text: str, regex: bool = False) - Check page content
├── element_exists(selector: str) - Check if CSS selector exists
├── todo_completed(expected_count: int) - Verify TODO completion
└── ... 8 more evaluators

Run with --json for full details or pipe to grep for filtering.

Publishing to Leaderboards

After running evaluations, view results on the leaderboard:

# Run evaluation
results = await run_dataset(
    "Claude-3.5 SheetBench",
    dataset="hud-evals/sheetbench-50",
    agent_class=ClaudeAgent
)

# Then visit: app.hud.so/leaderboards/hud-evals/sheetbench-50
# Click "My Jobs" to see your runs and create scorecards

Key Features

Reproducible: Docker environments ensure consistency
Parallel: Run multiple evaluations concurrently
Observable: Every tool call tracked with telemetry
Extensible: Easy to add new tasks or benchmarks

Next Steps

Create Agents

Build your own MCP-compatible agent

Leaderboards

Track and compare agent performance

Create Benchmarks

Build custom evaluation datasets

Get Started

Core Concepts

Evaluate Agents

Build Environments

Train Agents

CLI Reference

SDK Reference

Evaluate Agents

Quick Start

What You Can Do

1. Create Tasks for Browser Automation

2. Run Existing Benchmarks

3. Make Your Agent Work with HUD

Full Implementation Guide

Available Benchmarks

SheetBench-50

OSWorld-Verified

OnlineMind2Web

2048 Puzzles

Exploring Environments

Publishing to Leaderboards

Key Features

Next Steps

Create Agents

Leaderboards

Create Benchmarks

Get Started

Core Concepts

Evaluate Agents

Build Environments

Train Agents

CLI Reference

SDK Reference

​Quick Start

​What You Can Do

​1. Create Tasks for Browser Automation

​2. Run Existing Benchmarks

​3. Make Your Agent Work with HUD

Full Implementation Guide

​Available Benchmarks

SheetBench-50

OSWorld-Verified

OnlineMind2Web

2048 Puzzles

​Exploring Environments

​Publishing to Leaderboards

​Key Features

​Next Steps

Create Agents

Leaderboards

Create Benchmarks

Quick Start

What You Can Do

1. Create Tasks for Browser Automation

2. Run Existing Benchmarks

3. Make Your Agent Work with HUD

Available Benchmarks

Exploring Environments

Publishing to Leaderboards

Key Features

Next Steps