Example: Benchmarking Agents

This guide demonstrates how to use HUD to benchmark and compare the performance of different AI agents on a standardized set of tasks. This is crucial for understanding agent capabilities and identifying areas for improvement.

Goal: Evaluate two different agents (e.g., ClaudeAgent and OperatorAgent) on the hud-samples TaskSet and compare their results.

Concepts Covered:

  • Loading pre-built TaskSet objects.
  • Configuring and running different Agent implementations.
  • Using run_job() for batch evaluation.
  • Accessing and interpreting Job analytics.
  • Viewing detailed results and trajectories on the HUD platform.

Prerequisites

  • HUD SDK installed (Installation Guide).
  • API keys for the agents you want to benchmark (e.g., ANTHROPIC_API_KEY for Claude, OPENAI_API_KEY for Operator) set in your environment or .env file.
  • HUD_API_KEY set for uploading results to the HUD platform.

Step 1: Load a Standard TaskSet

We’ll use the hud-samples TaskSet, which contains a set of 10 tasks suitable for general agent evaluation. You can replace this with other TaskSets like WebVoyager or GAIA for more specific benchmarks.

import asyncio
from hud import load_taskset, run_job, ClaudeAgent, OperatorAgent

async def run_benchmarks():
    print("Loading hud-samples TaskSet...")
    taskset = await load_taskset("hud-samples")
    if not taskset.tasks:
        print("TaskSet is empty. Please ensure 'hud-samples' is a valid and populated TaskSet.")
        return
    print(f"Loaded TaskSet with {len(taskset.tasks)} tasks.")

Step 2: Configure Your Agents

Initialize the agents you want to compare. For this example, we’ll use ClaudeAgent and OperatorAgent (which requires OpenAI).

# In the same async def run_benchmarks():

    # Agent 1: ClaudeAgent
    claude_agent = ClaudeAgent()
    # You can customize agent parameters if needed, e.g.:
    # claude_agent_custom = ClaudeAgent(model="claude-3-opus-20240229")

    # Agent 2: OperatorAgent (OpenAI)
    # Make sure OPENAI_API_KEY is set for this agent and your organization allows for Operator agent use
    try:
        operator_agent = OperatorAgent()
    except ValueError as e:
        print(f"Could not initialize OperatorAgent: {e}. Skipping OperatorAgent benchmark.")
        operator_agent = None

    agents_to_benchmark = []
    agents_to_benchmark.append({"name": "ClaudeAgent", "instance": claude_agent})
    if operator_agent:
        agents_to_benchmark.append({"name": "OperatorAgent", "instance": operator_agent})

    if not agents_to_benchmark:
        print("No agents available for benchmarking.")
        return

Step 3: Run Evaluations for Each Agent

Use run_job() to execute the TaskSet for each agent. Each run will create a separate Job on the HUD platform.

# In the same async def run_benchmarks():

    job_results = {}

    for agent_info in agents_to_benchmark:
        agent_name = agent_info["name"]
        agent_instance = agent_info["instance"]
        
        print(f"\nRunning benchmark for: {agent_name}...")
        job_instance = await run_job(
            agent_cls=type(agent_instance), # Pass the class of the agent -- this is so every paralellized agent is a different instance
            agent_kwargs={}, # Pass any specific init kwargs for this agent if needed, otherwise uses defaults
            task_or_taskset=taskset,
            job_name=f"{agent_name} on hud-samples Benchmark",
            # adapter_cls can be specified if not using agent's default
        )
        print(f"Job for {agent_name} completed. Job ID: {job_instance.id}")
        print(f"View results at: https://app.hud.so/jobs/{job_instance.id}")
        job_results[agent_name] = job_instance

When passing agent_cls to run_job, you provide the class itself (e.g., ClaudeAgent, not claude_agent). If your agent instance agent_instance was initialized with non-default parameters (e.g., a specific model), you would pass those parameters via agent_kwargs to run_job. For agents like OperatorAgent that might take an environment argument in their __init__, this would also go into agent_kwargs if not using the default (e.g., agent_kwargs={"environment": "browser"}).

Step 4: Compare Analytics

After the jobs are complete, you can fetch and compare their summary analytics.

# In the same async def run_benchmarks():

    print("\n--- Benchmark Summary ---")
    for agent_name, job_instance in job_results.items():
        try:
            analytics = await job_instance.get_analytics()
            print(f"\nResults for {agent_name} (Job ID: {job_instance.id}):")
            print(f"  Task Count: {analytics.get('task_count')}")
            print(f"  Success Rate: {analytics.get('success_rate'):.2f}%")
            print(f"  Average Reward: {analytics.get('avg_reward')}") # Might be None if not applicable
        except Exception as e:
            print(f"Could not fetch analytics for {agent_name} (Job ID: {job_instance.id}): {e}")

    print("\nDetailed trajectories and comparisons available at https://app.hud.so/jobs")

# To run the benchmark:
# if __name__ == "__main__":
#     asyncio.run(run_benchmarks())

Full Code Example

import asyncio
import logging
from hud import load_taskset, run_job, ClaudeAgent, OperatorAgent # Make sure agents are imported

# Optional: Configure logging to see HUD SDK logs
logging.basicConfig(level=logging.INFO)
logging.getLogger("hud").setLevel(logging.DEBUG) # For more detailed HUD logs

async def run_benchmarks():
    print("Loading hud-samples TaskSet...")
    try:
        taskset = await load_taskset("hud-samples")
    except Exception as e:
        print(f"Failed to load TaskSet 'hud-samples': {e}")
        return

    if not taskset or not taskset.tasks:
        print("TaskSet 'hud-samples' is empty or could not be loaded.")
        return
    print(f"Loaded TaskSet '{taskset.name if taskset.name else taskset.id}' with {len(taskset.tasks)} tasks.")

    # Agent 1: ClaudeAgent
    claude_agent_instance = ClaudeAgent()

    # Agent 2: OperatorAgent (OpenAI)
    operator_agent_instance = None
    try:
        # OperatorAgent often requires an environment type for its tool configuration
        operator_agent_instance = OperatorAgent(environment="browser") 
    except ValueError as e:
        print(f"NOTICE: Could not initialize OperatorAgent (OpenAI API key might be missing or invalid): {e}. Skipping OperatorAgent benchmark.")
    except Exception as e:
        print(f"NOTICE: An unexpected error occurred while initializing OperatorAgent: {e}. Skipping.")

    agents_to_benchmark = []
    agents_to_benchmark.append({"name": "ClaudeAgent", "instance": claude_agent_instance, "init_kwargs": {}})
    if operator_agent_instance:
        # For OperatorAgent, if it needs specific init args like 'environment', pass them via agent_kwargs to run_job
        agents_to_benchmark.append({"name": "OperatorAgent", "instance": operator_agent_instance, "init_kwargs": {"environment": "browser"}})

    if not agents_to_benchmark:
        print("No agents available for benchmarking.")
        return

    job_results = {}

    for agent_info in agents_to_benchmark:
        agent_name = agent_info["name"]
        agent_instance = agent_info["instance"]
        agent_init_kwargs = agent_info["init_kwargs"]

        print(f"\nRunning benchmark for: {agent_name}...")
        try:
            job_instance = await run_job(
                agent_cls=type(agent_instance),
                agent_kwargs=agent_init_kwargs, # Pass agent-specific init args here
                task_or_taskset=taskset,
                job_name=f"{agent_name} on {taskset.name if taskset.name else taskset.id} Benchmark",
            )
            print(f"Job for {agent_name} completed. Job ID: {job_instance.id}")
            print(f"View results at: https://app.hud.so/jobs/{job_instance.id}")
            job_results[agent_name] = job_instance
        except Exception as e:
            print(f"Failed to run job for {agent_name}: {e}")
            job_results[agent_name] = None # Mark as failed or skip

    print("\n--- Benchmark Summary ---")
    for agent_name, job_instance in job_results.items():
        if not job_instance:
            print(f"\nResults for {agent_name}: Job execution failed.")
            continue
        try:
            analytics = await job_instance.get_analytics()
            print(f"\nResults for {agent_name} (Job ID: {job_instance.id}):")
            print(f"  Task Count: {analytics.get('task_count')}")
            success_rate = analytics.get('success_rate')
            avg_reward = analytics.get('avg_reward')
            print(f"  Success Rate: {success_rate:.2f}%" if success_rate is not None else "  Success Rate: N/A")
            print(f"  Average Reward: {avg_reward}" if avg_reward is not None else "  Average Reward: N/A")
        except Exception as e:
            print(f"Could not fetch analytics for {agent_name} (Job ID: {job_instance.id}): {e}")

    print("\nDetailed trajectories and comparisons for successful jobs available at https://app.hud.so/jobs")

if __name__ == "__main__":
    # Ensure HUD_API_KEY and agent-specific keys (e.g., ANTHROPIC_API_KEY) are set
    # in your environment or .env file.
    asyncio.run(run_benchmarks())

Interpreting Results

  • HUD Platform: The most comprehensive way to compare agents is by viewing their jobs on app.hud.so. You can see side-by-side video recordings of trajectories, detailed action logs, and evaluation results for each task.
  • job.get_analytics(): Provides a quick summary of performance, including success_rate and avg_reward (if applicable based on your evaluation functions).
  • Custom Analysis: You can further analyze by loading trajectories (await job_instance.load_trajectories()) and inspecting raw data like action sequences, errors, and timing for each task.

Next Steps

  • Explore other TaskSets: Try benchmarks like WebVoyager, Mind2Web, or GAIA for more targeted evaluations.
  • Customize Agents: Experiment with different models, system prompts, or adapter settings for your agents.
  • Create Your Own TaskSets: Design specific evaluations tailored to your agent’s capabilities using the Task Creation Guide.