Train agents to perform better on specific tasks using GRPO (Group Relative Policy Optimization).

Quick Start

Prerequisites

  • GPU with 24GB+ VRAM (A100, RTX 4090)
  • Python 3.12+
  • HUD API key (set HUD_API_KEY environment variable)

Installation

# Clone repository
git clone https://github.com/hud-evals/hud-python
cd hud-python/rl

# Clone hud-vf-gym (required)
git clone https://github.com/hud-evals/hud-vf-gym.git

# Install with uv (recommended)
uv sync
source .venv/bin/activate

Train 2048 Agent

1

Build Docker Image

# Set API key
export HUD_API_KEY="YOUR_API_KEY"

# Build the 2048 environment
cd ../../environments/text_2048/
docker build -t hud-text-2048 .
cd ../../rl/
2

Start vLLM Server

# Terminal 1 - GPU 0
CUDA_VISIBLE_DEVICES=0 vf-vllm \
  --model Qwen/Qwen2.5-3B-Instruct \
  --enforce-eager \
  --disable-log-requests
3

Run Training

# Terminal 2 - GPU 1
CUDA_VISIBLE_DEVICES=1 python train_2048.py
The included train_2048.py script trains a 3B model on the 2048 game. Training takes ~30 minutes for 100 steps. RL Training Progress Qwen-2.5-3B agent training on the text-2048 environment using GRPO

How It Works

GRPO trains by comparing multiple attempts on the same task:

Configuration

Training uses YAML configs to map agent tools to MCP tools.:
# From rl/configs/2048.yaml
system_prompt: |
  You are an excellent 2048 player. 
  Available moves: left(), right(), up(), down(), done()
  Return ONLY: <tool>move()</tool>

action_mappings:
  left:
    _tool: "move"
    direction: {static: "left"}
  right:
    _tool: "move"  
    direction: {static: "right"}
  # ... up, down similar

Custom Training

import verifiers as vf

# Load environment
env = vf.load_environment(
    env_id="hud-vf-gym",
    taskset="your-org/your-taskset",  # HuggingFace dataset
    config_path="configs/your-env.yaml",  # In hud-vf-gym
    num_tasks=10
)

# Configure training (matching train_2048.py)
model, tokenizer = vf.get_model_and_tokenizer("Qwen/Qwen2.5-3B-Instruct")
args = vf.grpo_defaults(run_name="my-training")
args.gradient_accumulation_steps = 2
args.per_device_train_batch_size = 8
args.num_generations = 16
args.max_steps = 100
args.learning_rate = 1e-6

# Train with LoRA
trainer = vf.GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    env=env,
    args=args,
    peft_config=vf.lora_defaults()
)
trainer.train()

Monitoring

HUD Dashboard

View training progress and traces at app.hud.so:
  • Real-time screenshot and setp viewing
  • Tool call traces for each trajectory
  • Reward distribution across generations

Weights & Biases

For detailed ML metrics:
args.report_to = "wandb"
os.environ["WANDB_API_KEY"] = "your-key"

Resources