Skip to content

Conversation

@cjac
Copy link
Contributor

@cjac cjac commented May 4, 2025

This commit significantly refactors the GPU initialization action to improve support for custom image builds, enhance robustness, and update documentation.

Key Changes:

  1. Custom Image Building (invocation-type=custom-images):

    • The script now detects the invocation-type=custom-images metadata.
    • When detected, Hadoop/Spark configurations are deferred to the first boot of a cluster instance created from the custom image. This is managed by a new systemd service, dataproc-gpu-config.service.
    • This prevents issues where configurations are applied too early in the image build process.
  2. GCS Caching and Performance:

    • The README now extensively details the GCS caching mechanism for downloaded artifacts (drivers, CUDA) and compiled components (kernel modules, NCCL).
    • Highlights the significant time savings on subsequent runs after the cache is warmed.
    • Warns about potentially long first-run times (up to 150 mins on small instances) if components need to be built from source. Recommends pre-warming the cache on a larger instance.
    • Notes the security benefit of using cached artifacts, reducing the need for build tools on cluster nodes.
  3. Hash Validation:

    • Added SHA256 hash verification for downloaded NVIDIA driver and CUDA .run files to ensure integrity.
  4. Documentation (gpu/README.md):

    • Fully revamped to reflect the script changes.
    • Updated default CUDA versions and tested configurations.
    • Clearer gcloud examples.
    • New section on custom image usage.
    • Updated metadata parameters list.
    • Improved Secure Boot and troubleshooting sections.
    • Clarified GPU agent metric reporting.
  5. Script Enhancements (gpu/install_gpu_driver.sh):

    • Refactored configuration logic into functions called conditionally.
    • Improved GPG key fetching behind a proxy.
    • Adjusted Conda paths for Dataproc 2.3+.
    • More robust kernel-devel fetching on Rocky Linux.
    • Better DATAPROC_IMAGE_VERSION detection.

Purpose:

These changes make the GPU initialization action more flexible for use in custom image pipelines, improve the reliability of installations, and provide users with better guidance on performance and security implications.

@cjac
Copy link
Contributor Author

cjac commented May 4, 2025

/gcbrun

@cjac cjac changed the title [gpu] Defer GPU config on custom images to resolve #1303 [gpu] Enhance driver installer and update README for custom images, versions, and performance May 4, 2025
@cjac cjac requested a review from Deependra-Patel May 6, 2025 10:22
@cjac
Copy link
Contributor Author

cjac commented May 6, 2025

/gcbrun

@cjac cjac requested review from rrohanarora and singhravidutt and removed request for singhravidutt May 20, 2025 23:58
This PR significantly refactors the GPU initialization action to improve support for custom image builds, enhance robustness, and update documentation.

**Key Changes:**

1.  **Custom Image Building (`invocation-type=custom-images`):**
    *   The script now detects the `invocation-type=custom-images` metadata.
    *   When detected, Hadoop/Spark configurations are deferred to the first boot of a cluster instance created from the custom image. This is managed by a new systemd service, `dataproc-gpu-config.service`.
    *   This prevents issues where configurations are applied too early in the image build process.

2.  **GCS Caching and Performance:**
    *   The README now extensively details the GCS caching mechanism for downloaded artifacts (drivers, CUDA) and compiled components (kernel modules, NCCL).
    *   Highlights the significant time savings on subsequent runs after the cache is warmed.
    *   Warns about potentially long first-run times (up to 150 mins on small instances) if components need to be built from source. Recommends pre-warming the cache on a larger instance.
    *   Notes the security benefit of using cached artifacts, reducing the need for build tools on cluster nodes.

3.  **Hash Validation:**
    *   Added SHA256 hash verification for downloaded NVIDIA driver and CUDA `.run` files to ensure integrity.

4.  **Documentation (`gpu/README.md`):**
    *   Fully revamped to reflect the script changes.
    *   Updated default CUDA versions and tested configurations.
    *   Clearer `gcloud` examples.
    *   New section on custom image usage.
    *   Updated metadata parameters list.
    *   Improved Secure Boot and troubleshooting sections.
    *   Clarified GPU agent metric reporting.

5.  **Script Enhancements (`gpu/install_gpu_driver.sh`):**
    *   Refactored configuration logic into functions called conditionally.
    *   Improved GPG key fetching behind a proxy.
    *   Adjusted Conda paths for Dataproc 2.3+.
    *   More robust `kernel-devel` fetching on Rocky Linux.
    *   Better `DATAPROC_IMAGE_VERSION` detection.

**Purpose:**

These changes make the GPU initialization action more flexible for use in custom image pipelines, improve the reliability of installations, and provide users with better guidance on performance and security implications.
@cjac
Copy link
Contributor Author

cjac commented Oct 12, 2025

Closing and starting a new one.

@cjac cjac closed this Oct 12, 2025
@cjac
Copy link
Contributor Author

cjac commented Dec 6, 2025

merged into #1363

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant