Tasks format

HUD tasksets can be provided in two primary formats (both supported):
  1. A single JSON file containing a list of task objects (recommended)
[
  {
    "id": "browser_2048_128",
    "prompt": "Reach 128 in 2048.",
    "mcp_config": {
      "hud": {
        "url": "https://mcp.hud.so/v3/mcp",
        "headers": {
          "Authorization": "Bearer ${HUD_API_KEY}",
          "Mcp-Image": "hudevals/hud-browser:0.1.3"
        }
      }
    },
    "setup_tool": {"name": "launch_app", "arguments": {"app_name": "2048"}},
    "evaluate_tool": {"name": "evaluate", "arguments": {"name": "game_2048_max_number", "arguments": {"target": 128}}}
  }
]
Save as basic-2048.json and run:
hud eval basic-2048.json
hud rl basic-2048.json
  1. JSONL file with one task object per line
  • prompt: instruction for the agent
  • mcp_config: where to run the environment (local docker or remote MCP)
  • setup_tool (optional): a tool call to prepare the environment
  • evaluate_tool: a tool call to compute reward
  • system_prompt (optional): extra guidance for the agent

Minimal JSONL example

{"id": "browser_2048_128", "prompt": "Reach 128 in 2048.", "mcp_config": {"hud": {"url": "https://mcp.hud.so/v3/mcp", "headers": {"Authorization": "Bearer ${HUD_API_KEY}", "Mcp-Image": "hudevals/hud-browser:0.1.3"}}}, "setup_tool": {"name": "launch_app", "arguments": {"app_name": "2048"}}, "evaluate_tool": {"name": "evaluate", "arguments": {"name": "game_2048_max_number", "arguments": {"target": 128}}}}
Save as basic-2048.jsonl and run:
hud eval basic-2048.jsonl
hud rl basic-2048.jsonl

Hosting on HuggingFace

You can host tasksets on the Hub and fetch them with:
hud get hud-evals/basic-2048
The command downloads the JSONL task file and places it in your project directory.

Tips

  • Keep tasks self-contained; use setup_tool to open apps or load data
  • Ensure evaluate_tool returns a numeric reward per episode
  • Use small task counts to iterate quickly; scale up once stable