@@ -6,7 +6,7 @@ This guide explains how to set up the environment and train the HSTU/DLRM models
66
77If you are developing on a TPU VM directly, use a virtual environment to avoid conflicts with the system-level Python packages.
88
9- #### 1. Prerequisites
9+ ### 1. Prerequisites
1010Ensure you have ** Python 3.11+** installed.
1111``` bash
1212python3 --version
@@ -41,7 +41,7 @@ python dlrm_experiment_test.py
4141
4242If you prefer not to manage a virtual environment or want to deploy this as a container, you can build a Docker image.
4343
44- ## 1. Build the Image
44+ ### 1. Create a Dockerfile
4545Create a file named ` Dockerfile ` in the root of the repository:
4646
4747``` dockerfile
@@ -54,6 +54,9 @@ WORKDIR /app
5454# Copy the current directory contents into the container
5555COPY . /app
5656
57+ # This tells Python to look in /app for the 'recml' package
58+ ENV PYTHONPATH="${PYTHONPATH}:/app"
59+
5760# Install system tools if needed (e.g., git)
5861RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
5962
@@ -68,4 +71,28 @@ RUN pip install "protobuf>=6.31.1" --no-deps
6871CMD ["python" , "recml/examples/dlrm_experiment_test.py" ]
6972```
7073
71- You can use this dockerfile to run the DLRM model experiment from this repo in your own environment.
74+ You can use this dockerfile to run the DLRM model experiment from this repo in your own environment.
75+
76+ ### 2. Build the Image
77+
78+ Run this command from the root of the repository. It reads the ` Dockerfile ` , installs all dependencies, and creates a ready-to-run image.
79+
80+ ``` bash
81+ docker build -t recml-training .
82+ ```
83+
84+ ### 3. Run the Image
85+
86+ ``` bash
87+ docker run --rm --privileged \
88+ --net=host \
89+ --ipc=host \
90+ --name recml-experiment \
91+ recml-training
92+ ```
93+
94+ ### What is happening here?
95+ * ** ` --rm ` ** : Automatically deletes the container after the script finishes to keep your disk clean.
96+ * ** ` --privileged ` ** : Grants the container direct access to the host's hardware devices, which is required to see the physical TPU chips.
97+ * ** ` --net=host ` ** : Removes the container's network isolation, allowing the script to connect to the TPU runtime listening on local ports (e.g., 8353).
98+ * ** ` --ipc=host ` ** : Allows the container to use the host's Shared Memory (IPC), which is critical for high-speed data transfer between the CPU and TPU.
0 commit comments