feat: add cluster setup files

cchadj · cchadj · commit 1b0838c5207c · 2026-01-22T13:10:25.000+02:00
diff --git a/.gitignore b/.gitignore
@@ -1 +1,2 @@
-__pycache__
+__pycache__
+*.ply
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,70 @@
+# AGENTS
+
+Quick orientation and cluster-specific setup for this `sam-3d-objects` fork.
+
+## Repo overview
+- Model: SAM 3D Objects (single image -> 3D geometry/texture/layout).
+- Primary docs: `README.md`, `doc/setup.md`, `SAM3D_SETUP_NOTES.md`.
+- Cluster helpers live in `repro/` (scripts for reproducible runs on this cluster).
+
+## Cluster requirements
+- Linux platform `linux-64`.
+- NVIDIA GPU with >= 32 GB VRAM (A6000 preferred).
+- Build/install on a GPU node to avoid PyTorch3D CPU-only builds.
+
+## Recommended Slurm allocation
+```
+salloc -p a6000 --gres=gpu:1 --cpus-per-task=8 --mem=32G --time=02:00:00
+srun --pty bash
+```
+
+## Environment setup (mamba)
+```
+cd /path/to/sam-3d-objects
+
+mamba env create -f environments/default.yml
+mamba activate sam3d-objects
+
+export PIP_EXTRA_INDEX_URL="https://pypi.ngc.nvidia.com https://download.pytorch.org/whl/cu121"
+pip install -e '.[dev]'
+pip install -e '.[p3d]'
+
+export PIP_FIND_LINKS="https://nvidia-kaolin.s3.us-east-2.amazonaws.com/torch-2.5.1_cu121.html"
+pip install -e '.[inference]'
+
+./patching/hydra
+```
+
+## Hugging Face checkpoints
+Access is required for `facebook/sam-3d-objects`.
+```
+pip install 'huggingface-hub[cli]<1.0'
+hf auth login
+
+TAG=hf
+hf download \
+  --repo-type model \
+  --local-dir checkpoints/${TAG}-download \
+  --max-workers 1 \
+  facebook/sam-3d-objects
+mv checkpoints/${TAG}-download/checkpoints checkpoints/${TAG}
+rm -rf checkpoints/${TAG}-download
+```
+
+## Sanity checks
+```
+nvidia-smi
+mamba info | rg "platform|platforms"
+
+python - <<'PY'
+import torch
+print("cuda:", torch.cuda.is_available())
+if torch.cuda.is_available():
+    print(torch.cuda.get_device_name(0))
+PY
+```
+
+## Quick run
+```
+python demo.py
+```
diff --git a/README.md b/README.md
@@ -29,6 +29,40 @@ SAM 3D Objects is one part of SAM 3D, a pair of models for object and human mesh
 
 Follow the [setup](doc/setup.md) steps before running the following.
 
+## Slurm quickstart (cluster navigation)
+
+This project is often run on a Slurm cluster. Here are the core concepts and the most common commands.
+
+**Concepts**
+- Controller: the login node where you run Slurm commands (`sinfo`, `squeue`).
+- Node: a compute machine (e.g. `gpu01`); jobs run here.
+- Partition: a queue of nodes with shared policies (e.g. `defq`, `a6000`).
+- Job/step: a scheduled unit of work (`sbatch` for batch jobs, `srun` for steps).
+- GRES/TRES: resource labels like GPUs (`gres/gpu=1`) and memory/CPU tracking.
+
+**Find resources**
+- Nodes and state: `sinfo -N -l`
+- Node details (GPUs/CPU/RAM): `scontrol show node gpu01`
+- Your jobs: `squeue -u $USER`
+- Watch your queue: `watch -n 2 "squeue -u $USER -o '%.18i %.9P %.20j %.8T %.10M %.6D %R'"`
+
+**Run work**
+- Interactive shell on a node: `srun -N 1 -n 1 -c 4 --mem=16G --pty bash`
+- Run a command on a specific node: `srun -w gpu01 hostname`
+- Request GPUs (required for `nvidia-smi` to see devices):
+  `srun -w gpu01 --gres=gpu:1 nvidia-smi -L`
+- Batch job (script):
+  `sbatch path/to/job.sh`
+
+**Control jobs**
+- Cancel job: `scancel <jobid>`
+- Inspect job: `scontrol show job <jobid>`
+
+**Resource flags (common)**
+- CPUs: `-c 8` or `--cpus-per-task=8`
+- Memory: `--mem=64G` or `--mem-per-cpu=4G`
+- GPUs: `--gres=gpu:1` (or `--gpus-per-task=1` if configured)
+
 ## Single or Multi-Object 3D Generation
 
 SAM 3D Objects can convert masked objects in an image, into 3D models with pose, shape, texture, and layout. SAM 3D is designed to be robust in challenging natural images, handling small objects and occlusions, unusual poses, and difficult situations encountered in uncurated natural scenes like this kidsroom:
diff --git a/SAM3D_SETUP_NOTES.md b/SAM3D_SETUP_NOTES.md
@@ -0,0 +1,93 @@
+# SAM 3D Objects - Cluster Setup Notes
+
+This document captures the findings and a complete setup flow for `sam-3d-objects` on this Slurm cluster using mamba.
+
+## Repository Location
+
+- Repo path: `$REPO_ROOT`
+
+## Prerequisites (from `doc/setup.md`)
+
+- Linux 64-bit (mamba platform `linux-64`).
+- NVIDIA GPU with at least 32 GB VRAM.
+- Build on a GPU node to avoid PyTorch3D "Not compiled with GPU support" errors.
+
+## Slurm Findings
+
+Partitions observed:
+
+- `defq` (nodes `gpu01-08`)
+- `a6000` (node `gpu09`)
+
+GPU resources for `a6000`:
+
+- `gpu09` has `gres=gpu:4` and is in partition `a6000`.
+- Use this partition to satisfy the >= 32 GB VRAM requirement (A6000 is typically 48 GB).
+
+## Recommended Interactive Allocation
+
+```
+salloc -p a6000 --gres=gpu:1 --cpus-per-task=8 --mem=32G --time=02:00:00
+srun --pty bash
+```
+
+## Environment Setup (mamba)
+
+From `doc/setup.md`:
+
+```
+cd $REPO_ROOT
+
+mamba env create -f environments/default.yml
+mamba activate sam3d-objects
+
+export PIP_EXTRA_INDEX_URL="https://pypi.ngc.nvidia.com https://download.pytorch.org/whl/cu121"
+pip install -e '.[dev]'
+pip install -e '.[p3d]'
+
+export PIP_FIND_LINKS="https://nvidia-kaolin.s3.us-east-2.amazonaws.com/torch-2.5.1_cu121.html"
+pip install -e '.[inference]'
+
+./patching/hydra
+```
+
+## GPU and Platform Verification
+
+```
+nvidia-smi
+mamba info | rg "platform|platforms"
+```
+
+Expected:
+
+- GPU present and visible in `nvidia-smi`.
+- `platform : linux-64` in `mamba info`.
+
+## Hugging Face Checkpoints
+
+Access required for `facebook/sam-3d-objects`.
+
+```
+pip install 'huggingface-hub[cli]<1.0'
+hf auth login
+
+TAG=hf
+hf download \
+  --repo-type model \
+  --local-dir checkpoints/${TAG}-download \
+  --max-workers 1 \
+  facebook/sam-3d-objects
+mv checkpoints/${TAG}-download/checkpoints checkpoints/${TAG}
+rm -rf checkpoints/${TAG}-download
+```
+
+## Sanity Check (CUDA)
+
+```
+python - <<'PY'
+import torch
+print("cuda:", torch.cuda.is_available())
+if torch.cuda.is_available():
+    print(torch.cuda.get_device_name(0))
+PY
+```
diff --git a/download_model.py b/download_model.py
@@ -0,0 +1,3 @@
+from huggingface_hub import hf_hub_download
+
+path = hf_hub_download("facebook/sam-3d-objects", "pipeline.yaml")
diff --git a/repro/capture_state.sh b/repro/capture_state.sh
@@ -0,0 +1,28 @@
+#!/usr/bin/env bash
+set -euo pipefail
+source "$(dirname "$0")/env.sh"
+
+mkdir -p "${REPO}/repro/state"
+
+# repo revision
+git -C "${REPO}" rev-parse HEAD > "${REPO}/repro/state/git_commit.txt"
+git -C "${REPO}" status --porcelain > "${REPO}/repro/state/git_dirty.txt" || true
+
+# container fingerprint
+sha256sum "${SIF}" > "${REPO}/repro/state/container.sha256"
+apptainer inspect "${SIF}" > "${REPO}/repro/state/container.inspect.txt" || true
+
+# environment package locks
+./repro/container_exec.sh "
+  ENV_PREFIX=\$(micromamba run -n sam3d-objects python -c 'import sys; print(sys.prefix)')
+  echo \"ENV_PREFIX=\$ENV_PREFIX\" > repro/state/env_prefix.txt
+
+  micromamba list -n sam3d-objects > repro/state/micromamba_list.txt
+  micromamba list -n sam3d-objects --explicit > repro/state/micromamba_explicit.txt
+
+  micromamba run -n sam3d-objects python -m pip freeze > repro/state/pip_freeze.txt
+
+  nvidia-smi > repro/state/nvidia-smi.txt || true
+  nvcc --version > repro/state/nvcc.txt || true
+  ldd --version | head -n 1 > repro/state/glibc.txt || true
+"
diff --git a/repro/container_exec.sh b/repro/container_exec.sh
@@ -0,0 +1,40 @@
+#!/usr/bin/env bash
+set -euo pipefail
+source "$(dirname "$0")/env.sh"
+
+# command to run inside container
+CMD="${*:-bash}"
+
+apptainer exec --nv --cleanenv \
+  --bind "${BASE}:${BASE}" \
+  --bind "${SCRATCH}:${SCRATCH}" \
+  "${SIF}" bash -lc "
+    set -euo pipefail
+
+    export HOME="\${HOME}"
+
+    export SCRATCH='${SCRATCH}'
+    export XDG_CACHE_HOME='${XDG_CACHE_HOME}'
+    export HF_HOME='${HF_HOME}'
+    export TORCH_HOME='${TORCH_HOME}'
+    export TMPDIR='${TMPDIR}'
+
+    export MAMBA_ROOT_PREFIX='${MAMBA_ROOT_PREFIX}'
+    export MAMBA_PKGS_DIRS='${MAMBA_PKGS_DIRS}'
+    export CONDA_PKGS_DIRS='${CONDA_PKGS_DIRS}'
+
+    export PATH='${SCRATCH}/bin':/usr/local/cuda/bin:\$PATH
+    export CUDA_HOME=/usr/local/cuda
+    export CUDACXX=/usr/local/cuda/bin/nvcc
+    export LD_LIBRARY_PATH=/usr/local/cuda/lib64:\${LD_LIBRARY_PATH:-}
+
+    # Make conda env libs visible at runtime (critical for open3d/kaolin/etc.)
+    ENV_PREFIX=\$(micromamba run -n sam3d-objects python -c 'import sys; print(sys.prefix)')
+    export LD_LIBRARY_PATH=\"\$ENV_PREFIX/lib:\$LD_LIBRARY_PATH\"
+
+    export TORCH_CUDA_ARCH_LIST='${TORCH_CUDA_ARCH_LIST}'
+    export SAM3D_HF_DIR='${SAM3D_HF_DIR}'
+
+    cd '${REPO}'
+    ${CMD}
+  "
diff --git a/repro/env.sh b/repro/env.sh
@@ -0,0 +1,33 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# ---- site-specific paths ----
+export BASE="${HOME}/data/${USER}"
+export REPO="${BASE}/projects/sam-3d-objects"
+
+export SCRATCH_ROOT="/path/to/scratch"
+export SCRATCH="${SCRATCH_ROOT}/${USER}"
+export SIF="${SCRATCH}/containers/cuda121-ubuntu22.sif"
+
+# micromamba root + packages on Lustre (avoid /trinity/home caches)
+export MAMBA_ROOT_PREFIX="${SCRATCH}/micromamba"
+export MAMBA_PKGS_DIRS="${SCRATCH}/micromamba/pkgs"
+export CONDA_PKGS_DIRS="${SCRATCH}/micromamba/pkgs"
+
+# caches on Lustre
+export XDG_CACHE_HOME="${SCRATCH}/cache"
+export HF_HOME="${SCRATCH}/cache/huggingface"
+export TORCH_HOME="${SCRATCH}/cache/torch"
+export SAM3D_HF_DIR="${SCRATCH}/sam3d-hf"
+
+# optional: keep pip temp on Lustre too
+export TMPDIR="${SCRATCH}/tmp"
+
+# CUDA build target (RTX A5000)
+export TORCH_CUDA_ARCH_LIST="8.6+PTX"
+
+mkdir -p \
+  "${SCRATCH}/containers" \
+  "${MAMBA_ROOT_PREFIX}" "${MAMBA_PKGS_DIRS}" \
+  "${XDG_CACHE_HOME}" "${HF_HOME}" "${TORCH_HOME}" \
+  "${SAM3D_HF_DIR}" "${TMPDIR}"
diff --git a/repro/install_deps.sh b/repro/install_deps.sh
@@ -0,0 +1,27 @@
+#!/usr/bin/env bash
+set -euo pipefail
+source "$(dirname "$0")/env.sh"
+
+"$(dirname "$0")/container_exec.sh" "
+  # keep pip stable + avoid packaging 25 issues
+  micromamba run -n sam3d-objects python -m pip install -U 'pip==24.3.1' 'setuptools' 'wheel' 'packaging<25'
+
+  # build backend used by the repo
+  micromamba run -n sam3d-objects python -m pip install -U hatchling hatch-requirements-txt editables
+
+  # git needed for git+https deps
+  micromamba install -y -n sam3d-objects -c conda-forge git
+
+  # runtime libs for open3d
+  micromamba install -y -n sam3d-objects -c conda-forge \
+    xorg-libx11 xorg-libxext xorg-libxrender xorg-libxi xorg-libxfixes xorg-libxrandr \
+    libgl libegl libglu mesalib libcxx libcxxabi
+
+  # ensure open3d extension is executable (ldd warning you saw)
+  ENV_PREFIX=\$(micromamba run -n sam3d-objects python -c 'import sys; print(sys.prefix)')
+  chmod a+rx \"\$ENV_PREFIX/lib/python3.11/site-packages/open3d/cpu/\"pybind*.so || true
+
+  # install project + inference extras (build gsplat against container nvcc)
+  micromamba run -n sam3d-objects python -m pip uninstall -y gsplat || true
+  micromamba run -n sam3d-objects python -m pip install -v --no-build-isolation -e '.[inference]'
+"
diff --git a/repro/pull_container.sh b/repro/pull_container.sh
@@ -0,0 +1,9 @@
+#!/usr/bin/env bash
+set -euo pipefail
+source "$(dirname "$0")/env.sh"
+
+apptainer pull "${SIF}" docker://nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04
+
+# record immutable fingerprint
+sha256sum "${SIF}" | tee "${REPO}/repro/container.sha256"
+apptainer inspect "${SIF}" > "${REPO}/repro/container.inspect.txt" || true
diff --git a/repro/run_demo_srun.sh b/repro/run_demo_srun.sh
@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+set -euo pipefail
+source "$(dirname "$0")/env.sh"
+
+srun --gres=gpu:1 --mem=64G -c 8 -t 02:00:00 --pty bash -lc "
+  cd '${REPO}'
+  ./repro/container_exec.sh \"micromamba run -n sam3d-objects python demo.py\"
+"
diff --git a/repro/state/container.inspect.txt b/repro/state/container.inspect.txt
@@ -0,0 +1,10 @@
+com.nvidia.cudnn.version: 8.9.0.131
+maintainer: NVIDIA CORPORATION <cudatools@nvidia.com>
+org.label-schema.build-arch: amd64
+org.label-schema.build-date: Wednesday_21_January_2026_17:58:10_CET
+org.label-schema.schema-version: 1.0
+org.label-schema.usage.apptainer.version: 1.1.4-2.el8
+org.label-schema.usage.singularity.deffile.bootstrap: docker
+org.label-schema.usage.singularity.deffile.from: nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04
+org.opencontainers.image.ref.name: ubuntu
+org.opencontainers.image.version: 22.04
diff --git a/repro/state/container.sha256 b/repro/state/container.sha256
@@ -0,0 +1 @@
+b5cb33dad6888293ef429236959add862ad9c2e798ec4c3a90d284d7019cdc23  /lustreFS/data/veupnea/cchadjiminas/containers/cuda121-ubuntu22.sif
diff --git a/repro/state/env_prefix.txt b/repro/state/env_prefix.txt
@@ -0,0 +1 @@
+ENV_PREFIX=$MAMBA_ROOT_PREFIX/envs/sam3d-objects
diff --git a/repro/state/git_commit.txt b/repro/state/git_commit.txt
@@ -0,0 +1 @@
+e19b1699d492e132892ba6c4c6594e94fbdac8f3
diff --git a/repro/state/git_dirty.txt b/repro/state/git_dirty.txt
@@ -0,0 +1,5 @@
+?? SAM3D_SETUP_NOTES.md
+?? download_model.py
+?? repro/
+?? sam3d_env.sh
+?? splat.ply
diff --git a/repro/state/glibc.txt b/repro/state/glibc.txt
@@ -0,0 +1 @@
+ldd (Ubuntu GLIBC 2.35-0ubuntu3.4) 2.35
diff --git a/repro/state/micromamba_explicit.txt b/repro/state/micromamba_explicit.txt
diff --git a/repro/state/micromamba_list.txt b/repro/state/micromamba_list.txt
diff --git a/repro/state/nvcc.txt b/repro/state/nvcc.txt
diff --git a/repro/state/nvidia-smi.txt b/repro/state/nvidia-smi.txt
diff --git a/repro/state/pip_freeze.txt b/repro/state/pip_freeze.txt
diff --git a/sam3d_env.sh b/sam3d_env.sh

Original file line number	Diff line number	Diff line change
`@@ -1 +1,2 @@`
`1`		`-__pycache__`
	`1`	`+__pycache__`
	`2`	`+*.ply`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+from huggingface_hub import hf_hub_download`
	`2`	`+`
	`3`	`+path = hf_hub_download("facebook/sam-3d-objects", "pipeline.yaml")`