Skip to content

Commit 9f1c0bd

Browse files
authored
Merge pull request #146 from AI-Hypercomputer/ajkv/docker-fix
Updated dockerfile instructions to run DLRM
2 parents 4317d1c + 532eb7b commit 9f1c0bd

2 files changed

Lines changed: 54 additions & 3 deletions

File tree

Dockerfile

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Use an official Python 3.11 runtime as a parent image
2+
FROM python:3.11-slim
3+
4+
# Set the working directory
5+
WORKDIR /app
6+
7+
# Copy the current directory contents into the container
8+
COPY . /app
9+
10+
# This tells Python to look in /app for the 'recml' package
11+
ENV PYTHONPATH="${PYTHONPATH}:/app"
12+
13+
# Install system tools if needed (e.g., git)
14+
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
15+
16+
# Install dependencies
17+
RUN pip install --upgrade pip
18+
RUN pip install -r requirements.txt
19+
20+
# Force install the specific protobuf version
21+
RUN pip install "protobuf>=6.31.1" --no-deps
22+
23+
# Default command to run the training script
24+
CMD ["python", "recml/examples/dlrm_experiment_test.py"]

training.md

Lines changed: 30 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ This guide explains how to set up the environment and train the HSTU/DLRM models
66

77
If you are developing on a TPU VM directly, use a virtual environment to avoid conflicts with the system-level Python packages.
88

9-
#### 1. Prerequisites
9+
### 1. Prerequisites
1010
Ensure you have **Python 3.11+** installed.
1111
```bash
1212
python3 --version
@@ -41,7 +41,7 @@ python dlrm_experiment_test.py
4141

4242
If you prefer not to manage a virtual environment or want to deploy this as a container, you can build a Docker image.
4343

44-
## 1. Build the Image
44+
### 1. Create a Dockerfile
4545
Create a file named `Dockerfile` in the root of the repository:
4646

4747
```dockerfile
@@ -54,6 +54,9 @@ WORKDIR /app
5454
# Copy the current directory contents into the container
5555
COPY . /app
5656

57+
# This tells Python to look in /app for the 'recml' package
58+
ENV PYTHONPATH="${PYTHONPATH}:/app"
59+
5760
# Install system tools if needed (e.g., git)
5861
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
5962

@@ -68,4 +71,28 @@ RUN pip install "protobuf>=6.31.1" --no-deps
6871
CMD ["python", "recml/examples/dlrm_experiment_test.py"]
6972
```
7073

71-
You can use this dockerfile to run the DLRM model experiment from this repo in your own environment.
74+
You can use this dockerfile to run the DLRM model experiment from this repo in your own environment.
75+
76+
### 2. Build the Image
77+
78+
Run this command from the root of the repository. It reads the `Dockerfile`, installs all dependencies, and creates a ready-to-run image.
79+
80+
```bash
81+
docker build -t recml-training .
82+
```
83+
84+
### 3. Run the Image
85+
86+
```bash
87+
docker run --rm --privileged \
88+
--net=host \
89+
--ipc=host \
90+
--name recml-experiment \
91+
recml-training
92+
```
93+
94+
### What is happening here?
95+
* **`--rm`**: Automatically deletes the container after the script finishes to keep your disk clean.
96+
* **`--privileged`**: Grants the container direct access to the host's hardware devices, which is required to see the physical TPU chips.
97+
* **`--net=host`**: Removes the container's network isolation, allowing the script to connect to the TPU runtime listening on local ports (e.g., 8353).
98+
* **`--ipc=host`**: Allows the container to use the host's Shared Memory (IPC), which is critical for high-speed data transfer between the CPU and TPU.

0 commit comments

Comments
 (0)