hud eval
command runs an agent on a tasks file or a HuggingFace dataset.
Usage
Arguments
HuggingFace dataset (e.g.,
hud-evals/SheetBench-50
) or task JSON/JSONL file. If omitted, looks for a tasks file in the current directory.Agent backend to use:
claude
, openai
, or vllm
. If omitted, an interactive selector appears (including HUD hosted models).Options
Run the entire dataset (omit for single-task debug mode)
Model name for the chosen agent (required for some agents)
Comma-separated list of allowed tools
Max concurrent tasks
Maximum steps per task (default varies by mode)
Use process-based parallel execution for large datasets (100+ tasks)
Number of worker processes for parallel mode (auto-optimized if not set)
Maximum concurrent tasks per worker in parallel mode
Enable verbose agent output
Base URL for vLLM server (when using
--agent vllm
or HUD hosted models)Number of times to run each task (mini-batch style)
Examples
Notes
- If you select a HUD hosted model,
hud eval
will route through vLLM with the appropriate base model. - When
SOURCE
is omitted, an interactive file picker helps locate a tasks file.