RL Quickstart

Train agents to perform better on specific tasks using GRPO (Group Relative Policy Optimization).

Quick Start

Prerequisites

GPU with 24GB+ VRAM (A100, RTX 4090)
Python 3.12+
HUD API key (set HUD_API_KEY environment variable)

Installation

# Clone repository
git clone https://github.com/hud-evals/hud-python
cd hud-python/rl

# Clone hud-vf-gym (required)
git clone https://github.com/hud-evals/hud-vf-gym.git

# Install with uv (recommended)
uv sync
source .venv/bin/activate

Train 2048 Agent

Build Docker Image

# Set API key
export HUD_API_KEY="YOUR_API_KEY"

# Build the 2048 environment
cd ../../environments/text_2048/
docker build -t hud-text-2048 .
cd ../../rl/

Start vLLM Server

# Terminal 1 - GPU 0
CUDA_VISIBLE_DEVICES=0 vf-vllm \
  --model Qwen/Qwen2.5-3B-Instruct \
  --enforce-eager \
  --disable-log-requests

Run Training

# Terminal 2 - GPU 1
CUDA_VISIBLE_DEVICES=1 python train_2048.py

The included train_2048.py script trains a 3B model on the 2048 game. Training takes ~30 minutes for 100 steps. RL Training Progress

Qwen-2.5-3B agent training on the text-2048 environment using GRPO

How It Works

GRPO trains by comparing multiple attempts on the same task:

Configuration

Training uses YAML configs to map agent tools to MCP tools.:

# From rl/configs/2048.yaml
system_prompt: |
  You are an excellent 2048 player. 
  Available moves: left(), right(), up(), down(), done()
  Return ONLY: <tool>move()</tool>

action_mappings:
  left:
    _tool: "move"
    direction: {static: "left"}
  right:
    _tool: "move"  
    direction: {static: "right"}
  # ... up, down similar

Custom Training

import verifiers as vf

# Load environment
env = vf.load_environment(
    env_id="hud-vf-gym",
    taskset="your-org/your-taskset",  # HuggingFace dataset
    config_path="configs/your-env.yaml",  # In hud-vf-gym
    num_tasks=10
)

# Configure training (matching train_2048.py)
model, tokenizer = vf.get_model_and_tokenizer("Qwen/Qwen2.5-3B-Instruct")
args = vf.grpo_defaults(run_name="my-training")
args.gradient_accumulation_steps = 2
args.per_device_train_batch_size = 8
args.num_generations = 16
args.max_steps = 100
args.learning_rate = 1e-6

# Train with LoRA
trainer = vf.GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    env=env,
    args=args,
    peft_config=vf.lora_defaults()
)
trainer.train()

Monitoring

HUD Dashboard

View training progress and traces at app.hud.so:

Real-time screenshot and setp viewing
Tool call traces for each trajectory
Reward distribution across generations

Weights & Biases

For detailed ML metrics:

args.report_to = "wandb"
os.environ["WANDB_API_KEY"] = "your-key"

Resources

Example Training Script
HUD VF Gym - Verifiers integration

Get Started

Core Concepts

Evaluate Agents

Build Environments

Train Agents

CLI Reference

SDK Reference

RL Quickstart

Quick Start

Prerequisites

Installation

Train 2048 Agent

How It Works

Configuration

Custom Training

Monitoring

HUD Dashboard

Weights & Biases

Resources

Get Started

Core Concepts

Evaluate Agents

Build Environments

Train Agents

CLI Reference

SDK Reference

​Quick Start

​Prerequisites

​Installation

​Train 2048 Agent

​How It Works

​Configuration

​Custom Training

​Monitoring

​HUD Dashboard

​Weights & Biases

​Resources

Quick Start

Prerequisites

Installation

Train 2048 Agent

How It Works

Configuration

Custom Training

Monitoring

HUD Dashboard

Weights & Biases

Resources