Test and benchmark your agents on standardized tasks
HUD makes it easy to evaluate any MCP-compatible agent on a variety of tasks and benchmarks. Whether youβre testing Claude, Operator, or your custom agent, the evaluation flow is consistent.
To create a custom agent, inherit from MCPAgent and implement the required methods:
Copy
Ask AI
from hud.agents import MCPAgentfrom hud.types import AgentResponse, MCPToolCallclass MyCustomAgent(MCPAgent): async def get_response(self, messages: list[Any]) -> AgentResponse: # Call your LLM and return tool calls ... async def format_blocks(self, blocks: list[Any]) -> list[Any]: # Format content blocks into messages for your LLM ... async def format_tool_results( self, tool_calls: list[MCPToolCall], tool_results: list[Any] ) -> list[Any]: # Format tool results back into messages ...# Now it works with any HUD dataset!
After running evaluations, view results on the leaderboard:
Copy
Ask AI
# Run evaluationresults = await run_dataset( "Claude-3.5 SheetBench", dataset="hud-evals/sheetbench-50", agent_class=ClaudeAgent)# Then visit: app.hud.so/leaderboards/hud-evals/sheetbench-50# Click "My Jobs" to see your runs and create scorecards