-
Notifications
You must be signed in to change notification settings - Fork 3
[Refactor] Refactor NVSHMEM dependency #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughThis PR removes multiple distributed benchmark scripts, utilities, and build infrastructure while refactoring NVSHMEM discovery to support both explicit source paths and pre-installed wheel-based configurations. The changes eliminate AG-GEMM, all-gather, all-to-all, GEMM-RS, and reduce-scatter benchmark implementations, along with IPC benchmarks and related documentation. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
…sure compatibility with newer releases.
…ure symlink creation for versioned libraries. Update documentation for NVSHMEM setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tilelang/engine/lower.py (1)
91-96: Critical: Missing comma causes incorrect linker flags concatenation.Line 94-95 is missing a comma between the linker library flags and
-rdc=true. This will concatenate the strings into"-lnvshmem_host -lnvshmem_device-rdc=true"instead of being separate options, causing compilation failures.🐛 Proposed fix
options += [ "-I" + nvshmem_include_path, "-L" + nvshmem_lib_path, - "-lnvshmem_host -lnvshmem_device" - "-rdc=true", + "-lnvshmem_host", + "-lnvshmem_device", + "-rdc=true", ]docs/get_started/Installation.md (1)
71-75: Incorrect path andLD_LIBRARY_PATHinstructions.
- The path
./pynvshmemappears incorrect—based on the repository structure, it should be./tilelang/distributed/pynvshmem.LD_LIBRARY_PATHuses$NVSHMEM_SRCwhich won't be set when using the wheel fallback. Consider documenting both scenarios.📝 Proposed fix
You also need to install the `pynvshmem` package, which provides wrapped host-side Python API for NVSHMEM. ```bash -cd ./pynvshmem +cd ./tilelang/distributed/pynvshmem python setup.py install -export LD_LIBRARY_PATH="$NVSHMEM_SRC/build/src/lib:$LD_LIBRARY_PATH"+If using NVSHMEM built from source:
+bash +export LD_LIBRARY_PATH="$NVSHMEM_SRC/build/src/lib:$LD_LIBRARY_PATH" +
+
+If using the default NVSHMEM wheel, the library path is automatically resolved.</details> </blockquote></details> </blockquote></details>🤖 Fix all issues with AI agents
In `@tilelang/env.py`: - Line 268: The assertion uses an unnecessary f-string: update the assert statement that checks nvshmem_wheel_path.exists() to use a normal string for the message (remove the leading 'f' before the quoted message) so it reads as a plain assertion message for nvshmem_wheel_path.exists().🧹 Nitpick comments (7)
tilelang/engine/lower.py (1)
89-98: RedundantNVSHMEM_SRCcheck.The check at line 90 is redundant—if
env.NVSHMEM_SRCwere falsy, line 74 would have already raised the error. Consider removing the second conditional or consolidating the logic.♻️ Proposed simplification
if env.USE_DISTRIBUTED: - if env.NVSHMEM_SRC: - options += [ - "-I" + nvshmem_include_path, - "-L" + nvshmem_lib_path, - "-lnvshmem_host", - "-lnvshmem_device", - "-rdc=true", - ] - else: - raise ValueError("NVSHMEM_SRC is not set") + options += [ + "-I" + nvshmem_include_path, + "-L" + nvshmem_lib_path, + "-lnvshmem_host", + "-lnvshmem_device", + "-rdc=true", + ]tilelang/env.py (1)
262-271: Add error handling for missingnvidia-nvshmem-cu12wheel.If
USE_DISTRIBUTEDis enabled but neitherNVSHMEM_SRCis set nor thenvidia-nvshmem-cu12wheel is installed,distribution()will raisePackageNotFoundErrorwithout a helpful message.♻️ Proposed fix with clearer error handling
else: # installed from wheel - from importlib.metadata import distribution + from importlib.metadata import distribution, PackageNotFoundError - root = pathlib.Path(distribution("nvidia-nvshmem-cu12").locate_file("")) + try: + root = pathlib.Path(distribution("nvidia-nvshmem-cu12").locate_file("")) + except PackageNotFoundError: + raise RuntimeError( + "NVSHMEM_SRC is not set and nvidia-nvshmem-cu12 wheel is not installed. " + "Please either set NVSHMEM_SRC or install: pip install nvidia-nvshmem-cu12" + ) from None nvshmem_wheel_path = root / "nvidia" / "nvshmem"tilelang/distributed/pynvshmem/setup.py (3)
101-106: Consider catching a more specific exception.Catching bare
Exceptioncan mask unexpected errors. For symlink creation, typical failures areOSError(permission denied, filesystem issues) orFileExistsError.♻️ Proposed fix
try: # Create the symlink symlink_path.symlink_to(target) print(f"Created symlink: {symlink_path} -> {target}") - except Exception as e: + except OSError as e: print(f"Warning: Could not create symlink {symlink_path}: {e}")
93-99: Non-deterministic version selection when multiple versioned libraries exist.The comment mentions "latest if multiple" but
glob()doesn't guarantee ordering. If multiple versions exist (e.g.,libnvshmem_host.so.2,libnvshmem_host.so.3), selection is arbitrary.♻️ Proposed fix to select the latest version
# Find the versioned library file versioned_libs = list(lib_path.glob(pattern)) if not versioned_libs: continue - # Use the first match (or latest if multiple) - target = versioned_libs[0].name + # Use the latest version if multiple exist + target = sorted(versioned_libs)[-1].name
110-117: PotentialNonevalues passed to build system.If
USE_DISTRIBUTEDis not enabled in the environment,env.NVSHMEM_INCLUDE_DIRandenv.NVSHMEM_LIB_PATHwill beNone, causing the build to fail with an unclear error. Consider adding validation.♻️ Proposed fix
`@pathlib_wrapper` def nvshmem_deps(): # Ensure symlinks exist before returning dependencies ensure_nvshmem_symlinks() + if env.NVSHMEM_INCLUDE_DIR is None or env.NVSHMEM_LIB_PATH is None: + raise RuntimeError( + "NVSHMEM paths not configured. Ensure TILELANG_USE_DISTRIBUTED=1 is set " + "and either NVSHMEM_SRC is set or nvidia-nvshmem-cu12 is installed." + ) + include_dirs = [env.NVSHMEM_INCLUDE_DIR] library_dirs = [env.NVSHMEM_LIB_PATH] libraries = ["nvshmem_host", "nvshmem_device"] return include_dirs, library_dirs, librariestilelang/distributed/pynvshmem/CMakeLists.txt (1)
47-57: Add error handling for NVSHMEM path discovery.If
TILELANG_USE_DISTRIBUTEDis not enabled, the Python command will outputNoneas strings, causingfind_libraryto fail with an unclear error. Consider adding validation.♻️ Proposed fix
# find NVSHMEM from tilelang.env execute_process( COMMAND ${PYTHON_EXECUTABLE} "-c" - "from tilelang.env import env; print(env.NVSHMEM_INCLUDE_DIR); print(env.NVSHMEM_LIB_PATH)" - OUTPUT_VARIABLE NVSHMEM_PATHS) + "from tilelang.env import env; print(env.NVSHMEM_INCLUDE_DIR or ''); print(env.NVSHMEM_LIB_PATH or '')" + OUTPUT_VARIABLE NVSHMEM_PATHS + RESULT_VARIABLE _NVSHMEM_SUCCESS) +if(NOT _NVSHMEM_SUCCESS MATCHES 0) + message(FATAL_ERROR "Failed to get NVSHMEM paths from tilelang.env") +endif() string(REGEX REPLACE "\n" ";" NVSHMEM_PATHS "${NVSHMEM_PATHS}") list(GET NVSHMEM_PATHS 0 NVSHMEM_INCLUDE_DIR) list(GET NVSHMEM_PATHS 1 NVSHMEM_LIB_DIR) +if(NOT NVSHMEM_INCLUDE_DIR OR NOT NVSHMEM_LIB_DIR) + message(FATAL_ERROR "NVSHMEM paths not configured. Set TILELANG_USE_DISTRIBUTED=1 and ensure NVSHMEM is available.") +endif() set(NVSHMEM_INCLUDE_DIRS ${NVSHMEM_INCLUDE_DIR})docs/get_started/Installation.md (1)
61-68: Consider restructuring for clarity.The current flow could be clearer. Consider organizing as two distinct options:
📝 Suggested restructure
-Before running the examples using NVSHMEM APIs (e.g., [example_allgather.py](../../examples/distributed/example_allgather.py)), you need to install NVSHMEM dependencies. -You can set environment variable `NVSHMEM_SRC` to force using NVSHMEM built from source. +Before running examples using NVSHMEM APIs (e.g., [example_allgather.py](../../examples/distributed/example_allgather.py)), you need to install NVSHMEM dependencies. + +**Option 1: Use pre-installed NVSHMEM wheel (recommended)** + +No additional setup required—the default `nvidia-nvshmem-cu12` wheel will be used automatically. -```bash +**Option 2: Use NVSHMEM built from source** + +Set the environment variable to point to your custom NVSHMEM directory: +```bash export NVSHMEM_SRC="your_custom_nvshmem_dir"
-You can also skip this and use the default pre-installed NVSHMEM wheel.
</details> </blockquote></details> </blockquote></details> <details> <summary>📜 Review details</summary> **Configuration used**: defaults **Review profile**: CHILL **Plan**: Pro <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between eccbd61c37671beccb5208e65088a49987496dc2 and 9f96ff2ad60cd7b9e60875278fdae3a049e8b483. </details> <details> <summary>📒 Files selected for processing (30)</summary> * `benchmark/distributed/README.md` * `benchmark/distributed/benchmark_ag_gemm.py` * `benchmark/distributed/benchmark_all_gather.py` * `benchmark/distributed/benchmark_all_to_all.py` * `benchmark/distributed/benchmark_gemm_rs.py` * `benchmark/distributed/benchmark_reduce_scatter.py` * `benchmark/distributed/ipc_impls/README.md` * `benchmark/distributed/ipc_impls/benchmark_nvshmem_p2p.py` * `benchmark/distributed/ipc_impls/benchmark_unrolledcp_p2p.py` * `benchmark/distributed/utils.py` * `docs/get_started/Installation.md` * `examples/distributed/README.md` * `examples/distributed/nvshmem_legacy/example_all_to_all.py` * `examples/distributed/nvshmem_legacy/example_allgather.py` * `examples/distributed/nvshmem_legacy/example_allgather_gemm.py` * `examples/distributed/nvshmem_legacy/example_cannon.py` * `examples/distributed/nvshmem_legacy/example_nvshmem.py` * `examples/distributed/nvshmem_legacy/example_overlapping_allgather.py` * `examples/distributed/nvshmem_legacy/example_post_attn_all2all_transpose.py` * `examples/distributed/nvshmem_legacy/example_pre_attn_all2all.py` * `examples/distributed/nvshmem_legacy/example_pre_attn_all2all_transpose.py` * `examples/distributed/nvshmem_legacy/example_simple_shift.py` * `examples/distributed/nvshmem_legacy/example_summa.py` * `examples/distributed/nvshmem_legacy/gemm_rs_utils.py` * `testing/python/distributed/test_tilelang_language_ldst_options.py` * `tilelang/distributed/build_nvshmem.sh` * `tilelang/distributed/pynvshmem/CMakeLists.txt` * `tilelang/distributed/pynvshmem/setup.py` * `tilelang/engine/lower.py` * `tilelang/env.py` </details> <details> <summary>💤 Files with no reviewable changes (12)</summary> * tilelang/distributed/build_nvshmem.sh * benchmark/distributed/README.md * benchmark/distributed/ipc_impls/benchmark_unrolledcp_p2p.py * benchmark/distributed/benchmark_ag_gemm.py * benchmark/distributed/ipc_impls/benchmark_nvshmem_p2p.py * benchmark/distributed/utils.py * examples/distributed/README.md * benchmark/distributed/benchmark_reduce_scatter.py * benchmark/distributed/benchmark_all_gather.py * benchmark/distributed/benchmark_all_to_all.py * benchmark/distributed/benchmark_gemm_rs.py * benchmark/distributed/ipc_impls/README.md </details> <details> <summary>🧰 Additional context used</summary> <details> <summary>🪛 LanguageTool</summary> <details> <summary>docs/get_started/Installation.md</summary> [style] ~68-~68: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym. Context: ...he default pre-installed NVSHMEM wheel. You also need to install the `pynvshmem` pa... (ENGLISH_WORD_REPEAT_BEGINNING_RULE) </details> </details> <details> <summary>🪛 Ruff (0.14.11)</summary> <details> <summary>tilelang/distributed/pynvshmem/setup.py</summary> 105-105: Do not catch blind exception: `Exception` (BLE001) </details> <details> <summary>tilelang/env.py</summary> 268-268: f-string without any placeholders Remove extraneous `f` prefix (F541) </details> </details> </details> <details> <summary>🔇 Additional comments (2)</summary><blockquote> <details> <summary>tilelang/env.py (1)</summary><blockquote> `255-274`: **LGTM on the NVSHMEM path discovery logic.** The dual-path approach (source build vs. wheel fallback) is well-structured. The paths for source builds (`build/src/include`, `build/src/lib`) align with typical NVSHMEM build outputs. </blockquote></details> <details> <summary>tilelang/distributed/pynvshmem/CMakeLists.txt (1)</summary><blockquote> `70-75`: **LGTM on the target linking changes.** Using CMake variables for library paths is cleaner and more maintainable than hardcoding. </blockquote></details> </blockquote></details> <sub>✏️ Tip: You can disable this entire section by setting `review_details` to `false` in your review settings.</sub> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
| root = pathlib.Path(distribution("nvidia-nvshmem-cu12").locate_file("")) | ||
| nvshmem_wheel_path = root / "nvidia" / "nvshmem" | ||
| logger.warning(NVSHMEM_USE_DEFAULT_WHEEL_MESSAGE % nvshmem_wheel_path) | ||
| assert nvshmem_wheel_path.exists(), f"NVSHMEM wheel path does not exist" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove extraneous f prefix from string.
This f-string has no placeholders. Remove the f prefix.
🐛 Proposed fix
- assert nvshmem_wheel_path.exists(), f"NVSHMEM wheel path does not exist"
+ assert nvshmem_wheel_path.exists(), "NVSHMEM wheel path does not exist"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| assert nvshmem_wheel_path.exists(), f"NVSHMEM wheel path does not exist" | |
| assert nvshmem_wheel_path.exists(), "NVSHMEM wheel path does not exist" |
🧰 Tools
🪛 Ruff (0.14.11)
268-268: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
In `@tilelang/env.py` at line 268, The assertion uses an unnecessary f-string:
update the assert statement that checks nvshmem_wheel_path.exists() to use a
normal string for the message (remove the leading 'f' before the quoted message)
so it reads as a plain assertion message for nvshmem_wheel_path.exists().
Tasks
Summary by CodeRabbit
Release Notes
Documentation
Chores
✏️ Tip: You can customize this high-level summary in your review settings.