Environments
Build, test, and share custom Docker environments for HUD
Environment Creation & Contribution Guide
This guide walks you through creating custom Web App or Docker-based environments for the HUD SDK, testing them, and contributing them to the community.
Custom environments allow you to evaluate agents in highly specific scenarios, such as:
- Your own web applications or internal tools.
- Desktop applications with GUI interaction (e.g., via VNC).
- Specialized software stacks or game environments.
- Any scenario requiring a controlled, reproducible Docker setup.
1. Understanding Custom Environments
Custom environments in hud could be written two ways:
- Simple Web App Hosting: You can produce a static website or a simple web application and have
hud-browser
launch withsetup=("goto", "yourwebsite.com")
andevaluate=("page_contains", "your_success_criteria")
- Controller-Based Environments: For more complex interactions or non-web environments, you can create a
hud_controller
Python package within a Docker image. This controller exposes customsetup
andevaluate
functions that HUD can call.
Refer to browser environment for more on how to use the hud-browser
environment.
This tutorial will focus on the second option.
2. Designing Your Custom Environment
Before you start coding:
- Define the Purpose: What specific agent capabilities will this environment test?
- Identify Key Interactions: What are the core setup steps and success criteria?
- Choose Technology: Will it be a web app (see browser environment), a desktop app (via VNC), or something else? This influences your Docker base image.
3. Building a Controller-Based Environment
This is for environments where you need custom Python logic running inside the Docker container for setup and evaluation.
Directory Structure
Organize your environment in a dedicated folder, for example environments/my_new_env/
:
a. Dockerfile
Your Dockerfile sets up the base system, installs dependencies, and copies your controller code.
b. pyproject.toml
(for hud_controller
)
This makes your controller code an installable Python package named hud_controller
.
c. src/hud_controller/main.py
(Controller Logic)
Implement Python functions that can be called by Task.setup
and Task.evaluate
.
d. src/hud_controller/__init__.py
4. Building & Testing Locally
a. Define CustomGym
In your local Python script or notebook:
b. Create a Test Task
c. Run and Debug
Local to Remote: Once your environment is working well locally, you can easily transition it to remote execution for scaled testing and wider accessibility. See the “Scaling Up: Remote Execution & Parallel Testing” section below for details on pushing your image to a registry and updating your CustomGym
configuration.
Hot Reloading (for Controller-Based Environments)
HUD supports hot reloading for your hud_controller
code:
- Make changes to your Python files in
src/hud_controller/
. - The next time
gym.make(your_custom_gym_with_build_context)
is called, HUD will detect changes and reinstall yourhud_controller
package inside the running container (if one is reused) or in a new container, without a full Docker image rebuild for controller code changes.- Note: Changes to
Dockerfile
orpyproject.toml
dependencies will still require an image rebuild.
- Note: Changes to
5. Scaling Up: Remote Execution & Parallel Testing
Once you’ve developed and tested your custom environment locally (defined by a Dockerfile in a local directory), HUD makes it incredibly easy to run it on our remote infrastructure for scaled evaluations and parallel testing across multiple tasks.
a. Seamless Local-to-Remote Execution with Build Context Upload
You do not necessarily need to manually build, push to a Docker registry, and then reference that image for remote execution if your environment is defined by a local build context (a directory with a Dockerfile).
How it Works:
When you define a CustomGym
with location="remote"
and provide a local Path
to image_or_build_context
:
The HUD SDK will automatically:
- Package your local build context: It zips the contents of the directory specified in
image_or_build_context
. - Upload to HUD: Securely uploads this zipped build context to the HUD platform.
- Remote Docker Build: The HUD platform builds your Docker image from the uploaded context in its own isolated environment.
- Remote Execution: The platform then runs this newly built image to create the environment instance for your task.
This streamlined process means you can iterate locally, and when ready for scaled runs or sharing (without requiring local Docker on other machines), simply ensure location="remote"
when using a local Path
as the build context. The HUD platform handles the build and execution.
b. Using Pre-Pushed Images (Alternative for Remote Execution)
If you prefer to manage your own Docker image registry or have images already built and pushed (e.g., to Docker Hub, GCR, ECR), you can directly specify the image name for remote execution:
In this case, the HUD platform will pull and run the specified pre-built image from the registry.
c. Parallel Task Execution on the HUD Platform
When you run a TaskSet
using run_job()
and your CustomGym
is configured for location="remote"
(either via automatic build context upload or a pre-pushed image), the HUD platform automatically parallelizes the execution of tasks across multiple instances of your custom environment. This significantly speeds up large-scale evaluations.
Benefits of HUD’s Remote Execution:
- Simplified Workflow: Easy transition from local Dockerfile-based development to remote execution.
- Automatic Build & Deploy (for local contexts): HUD handles the build context upload and remote image creation for you when a local path is provided for remote execution.
- Scalability: Run many tasks in parallel without consuming local resources.
- Consistency: Ensures all evaluations use the same controlled cloud build and runtime.
- Accessibility: Team members or CI systems can run evaluations without needing local Docker setups.
6. Contributing Your Environment
We strongly encourage community contributions! If you’ve built a useful custom environment, share it with others.
When contributing, consider making your image publicly available on Docker Hub if it’s a pre-built image. If it’s defined by a build context, ensure the Dockerfile and any necessary files are self-contained within its directory for others to build locally or for HUD to build remotely.
Contribution Workflow:
- Fork & Branch: Fork the HUD SDK GitHub repository and create a new branch for your environment.
- Directory Structure: Place your environment code within the
environments/
directory. Follow the structure outlined above (e.g.,environments/your_env_name/
). - README.md: Create a
README.md
inside your environment’s folder (environments/your_env_name/README.md
). This should explain:- What your environment does and its purpose.
- Any prerequisites (e.g., specific software, API keys needed inside the env).
- How to build/run it locally.
- Example
Task
definitions demonstrating its use withsetup
andevaluate
.
- Dockerfile Best Practices: Keep your Dockerfile clean, commented, and use minimal base images where possible.
- Controller (if applicable): Ensure your
hud_controller
is well-structured and functions are documented. - Test Thoroughly:
- Test image building.
- Test
gym.make()
with yourCustomGym
spec. - Verify
Task.setup
andTask.evaluate
functions work as expected with your controller.
- Submit a PR:
- Target the
main
branch of thehud-evals/hud-sdk
repository. - Provide a clear description of your environment and its use cases in the PR.
- Link to any relevant issues.
- Target the
- Code Review: The HUD team will review your contribution for clarity, functionality, and adherence to guidelines.
Example Community Environments:
Check the environments/
directory in the SDK for inspiration:
environments/novnc_ubuntu/
: Provides an Ubuntu desktop accessible via VNC, for GUI-based tasks.environments/custom_website/
: A template for packaging and testing your own web application.environments/pokemon_controller/
: Example of a retro gaming environment.
Using Remote Custom Environments
This section now primarily refers to using images that are already in a registry or are built remotely by HUD from your local context.
If your Docker image is publicly accessible (e.g., on Docker Hub or another registry) OR if you are using a local build context with location="remote"
:
Tips for Robust Environments
- Health Checks: If your Docker image starts services, implement a health check in your Dockerfile or an initial setup step to ensure they are ready before the agent interacts.
- Deterministic Behavior: Aim for environments that reset to the same state each time for reproducible evaluations.
- Minimize Image Size: Use multi-stage builds or
.dockerignore
to keep image sizes manageable.
Related Guides
- Task Creation: How to define tasks that use your custom environments.
- Custom Environments Overview: Higher-level concepts of custom environments.
- Browser Environment: For standard web interaction tasks.