Skip to content

Bug fixes for GNN recipes#1667

Open
mnabian wants to merge 1 commit into
mainfrom
gnn-recipes-bug-fixes
Open

Bug fixes for GNN recipes#1667
mnabian wants to merge 1 commit into
mainfrom
gnn-recipes-bug-fixes

Conversation

@mnabian
Copy link
Copy Markdown
Collaborator

@mnabian mnabian commented May 22, 2026

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

@mnabian mnabian requested a review from ktangsali May 22, 2026 18:24
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 22, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mnabian mnabian self-assigned this May 22, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 22, 2026

Greptile Summary

This PR fixes several runtime bugs in GNN-based recipes: CUDA library conflicts from bundled TF GPU runtime, file-descriptor exhaustion in multiprocessing dataset loading, and a hardcoded PyG wheel URL that only worked for one specific torch+CUDA combination.

  • sintering_physics: Replaces tensorflow with tensorflow-cpu and removes the now-redundant GPU memory growth setup in both train.py and inference.py, eliminating cudaErrorStubLibrary crashes caused by TF's bundled CUDA 12 conflicting with the container's CUDA 13.
  • ahmed_body_dataset.py: Adds a ProcessPoolExecutor initializer that switches spawned workers to the file_system tensor-sharing strategy, preventing RuntimeError: received 0 items of ancdata when the default file-descriptor strategy exhausts RLIMIT_NOFILE under high worker counts.
  • xaeronet: Updates the README to use a runtime-detected torch+CUDA version string for the PyG wheel URL, and moves torch_scatter out of requirements.txt to avoid silent version-mismatch failures.

Important Files Changed

Filename Overview
physicsnemo/datapipes/gnn/ahmed_body_dataset.py Adds _init_pool_worker initializer to set file_system sharing strategy in spawn workers, fixing FD exhaustion when many tensors are returned through ProcessPoolExecutor
examples/additive_manufacturing/sintering_physics/requirements.txt Switches from tensorflow to tensorflow-cpu to avoid CUDA 12/13 runtime conflict in the PhysicsNeMo container
examples/additive_manufacturing/sintering_physics/inference.py Removes now-unnecessary tf.config GPU memory growth setup (correct since tensorflow-cpu has no GPU devices)
examples/additive_manufacturing/sintering_physics/train.py Same GPU memory growth removal as inference.py; correct cleanup for the tensorflow-cpu switch
examples/cfd/external_aerodynamics/xaeronet/README.md Replaces hardcoded torch-2.8.0+cu129 pyg-lib install URL with a dynamic snippet that detects the installed torch and CUDA versions at runtime
examples/cfd/external_aerodynamics/xaeronet/requirements.txt Moves torch_scatter out of requirements.txt (now installed via the version-matched PyG wheel URL in README) and adds scikit-learn, tabulate, matplotlib

Reviews (1): Last reviewed commit: "bug fixes" | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant