The hud eval command runs an agent on a tasks file or a HuggingFace dataset.

Usage

hud eval [SOURCE] [AGENT] [OPTIONS]

Arguments

source
string
HuggingFace dataset (e.g., hud-evals/SheetBench-50) or task JSON/JSONL file. If omitted, looks for a tasks file in the current directory.
agent
string
Agent backend to use: claude, openai, or vllm. If omitted, an interactive selector appears (including HUD hosted models).

Options

--full
boolean
default:"false"
Run the entire dataset (omit for single-task debug mode)
--model
string
Model name for the chosen agent (required for some agents)
--allowed-tools
string
Comma-separated list of allowed tools
--max-concurrent
integer
default:"50"
Max concurrent tasks
--max-steps
integer
default:"30"
Maximum steps per task (default varies by mode)
--parallel
boolean
default:"false"
Use process-based parallel execution for large datasets (100+ tasks)
--max-workers
integer
Number of worker processes for parallel mode (auto-optimized if not set)
--max-concurrent-per-worker
integer
default:"20"
Maximum concurrent tasks per worker in parallel mode
--verbose
boolean
default:"false"
Enable verbose agent output
--vllm-base-url
string
Base URL for vLLM server (when using --agent vllm or HUD hosted models)
--group-size
integer
default:"1"
Number of times to run each task (mini-batch style)

Examples

# Minimal (interactive agent selection)
hud eval tasks.json

# Entire dataset with Claude
hud eval hud-evals/SheetBench-50 claude --full

# vLLM with explicit base URL
hud eval tasks.json vllm --model llama3.1 --vllm-base-url http://localhost:8000

# Limit tools and concurrency
hud eval tasks.json claude --allowed-tools click,type --max-concurrent 10

Notes

  • If you select a HUD hosted model, hud eval will route through vLLM with the appropriate base model.
  • When SOURCE is omitted, an interactive file picker helps locate a tasks file.

See Also