Skip to content

Add CUDA image variants and smoke tests#33

Open
aapdo wants to merge 5 commits into
mainfrom
decs260515
Open

Add CUDA image variants and smoke tests#33
aapdo wants to merge 5 commits into
mainfrom
decs260515

Conversation

@aapdo
Copy link
Copy Markdown
Contributor

@aapdo aapdo commented May 15, 2026

Summary

  • Add build variants for CUDA 11.8, 12.2, 12.5, and 12.8 with matching TensorFlow versions.
  • Add Miniforge/micromamba, JupyterLab, system packages, and noVNC/VNC runtime support to the image flow.
  • Add image variant build helpers and Ansible-based smoke tests for GPU visibility, TensorFlow GPU detection, Jupyter, noVNC, and uid/create_container.sh compatibility.

Validation

  • Ran local syntax and config checks for Docker/entrypoint/test helper changes.
  • Ran remote smoke tests on LAB8, LAB9, LAB10, FARM1, and FARM9.
  • Built and pushed Docker Hub images under dguailab/decs for CUDA 11.8, 12.2, 12.5, and 12.8 variants.

Copilot AI review requested due to automatic review settings May 15, 2026 05:49
@gitguardian
Copy link
Copy Markdown

gitguardian Bot commented May 15, 2026

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
32875495 Triggered Generic Password a45066a scripts/test_uid_create_container.py View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR re-architects the DECS Docker image build to support multiple CUDA/TensorFlow variants, replacing the single-image flow built on tensorflow/tensorflow:2.18.0-gpu with a parameterized Dockerfile driven by image-variants.json. It adds Miniforge/JupyterLab/noVNC runtime support, refactors the entrypoint with driver/CUDA compatibility checks and an opt-in VNC stack, updates the GitHub Actions workflow to fan out builds via a matrix, and introduces Python helpers plus Ansible playbooks for building and smoke-testing the variants on remote GPU hosts.

Changes:

  • Parameterized Dockerfile + image-variants.json + matrix-based GitHub Actions workflow for CUDA 11.8/12.2/12.5/12.8 builds.
  • New entrypoint.sh with image runtime info, optional STRICT_CUDA_COMPAT driver check, and TigerVNC/noVNC startup.
  • New scripts/ (build/test/uid helpers) and tests/ansible/ playbooks (build + smoke), with documentation rewritten in README.md.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
.dockerignore Excludes git, scripts, tests, and README from the build context.
.github/workflows/docker-publish.yml Adds a prepare job that emits a build matrix and updates build-and-push to consume per-variant build args.
.gitignore Ignores Python bytecode artifacts.
Dockerfile Switches to nvidia/cuda base, parameterizes CUDA/TF/Python via build args, installs Miniforge/JupyterLab/VNC stack.
README.md Documents the variant matrix, build/test commands, VNC env vars, and admin notes.
entrypoint.sh Adds image runtime banner, driver compatibility gate, VNC/noVNC startup, and uses $JUPYTER_BIN/$CONDA_DIR.
image-variants.json Declares the four supported CUDA/TF variants with aliases and minimum driver versions.
scripts/build_variants.py CLI to build (and optionally push) variants from the manifest.
scripts/test_image_variants.py Local CPU/GPU smoke runner per variant.
scripts/test_uid_create_container.py Drives ~/uid/script_test/create_container.sh for each variant.
scripts/variant_matrix.py Loads the manifest, validates uniqueness, and emits a GitHub Actions matrix.
tests/ansible/decs_image_build.yml Uploads build context tarball and runs docker build on remote hosts.
tests/ansible/decs_image_smoke.yml Runs the image with create_container.sh-style env, validates GPU/TF/Jupyter/noVNC.
Comments suppressed due to low confidence (1)

entrypoint.sh:12

  • is_truthy accepts true/TRUE/1/yes/YES/on/ON but not the common Python-style True/False. Since environment variables in this repo flow through Ansible and Python helpers (e.g. the smoke playbook normalizes via ternary('true', 'false'), but ad-hoc docker run -e ENABLE_VNC=True from operators is plausible), consider also accepting mixed-case True/Yes/On (e.g., normalize with ${1,,} before matching). Otherwise ENABLE_VNC=True will be silently treated as false.
is_truthy() {
    case "${1:-}" in
        true|TRUE|1|yes|YES|on|ON) return 0 ;;
        *) return 1 ;;
    esac
}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if: github.event_name == 'workflow_dispatch' || github.event.pull_request.merged == true
runs-on: ubuntu-latest
outputs:
tag_name: ${{ steps.generate_tag.outputs.TAG_NAME }}
Comment thread entrypoint.sh
Comment on lines +5 to +6
USER_PW="${USER_PW:-ailab2260}"

Comment thread Dockerfile
Comment on lines +101 to +103
RUN apt-get update \
&& apt-get install -y --no-install-recommends tigervnc-tools \
&& rm -rf /var/lib/apt/lists/*
--runtime=nvidia
--cap-add=SYS_ADMIN
--ipc=host
--mount type=bind,source={{ test_home_root | quote }},target=/home/
Comment on lines +54 to +57
ansible.builtin.shell: >
timeout 90 bash -lc
'until docker exec {{ test_container_name | quote }} test -f /home/{{ test_username }}/decs_jupyter_lab/jupyter_token.txt;
do sleep 3; done'
Comment thread scripts/build_variants.py
Comment on lines +65 to +66
tags = build_tags(repository, variant, date_tag)
cmd = build_command(variant, repository, date_tag, args.no_cache)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants