Conversation
021e234 to
a07df0d
Compare
- Increase dev mode GCE disk to 300GB pd-balanced for compilation - Keep production mode at 100GB pd-ssd for optimal serving performance - Fixes exit code 128 failures due to insufficient disk space during Rust compilation Signed-off-by: Nick Mitchell <nickm@us.ibm.com>
… compatibility - Add vllm_precompiled_wheel_commit field to GceConfig with default value from llm-d - Update setup-dev.sh to use VLLM_PRECOMPILED_WHEEL_COMMIT for wheel lookup - Propagate parameter through cloud-config.yaml and up.rs substitutions - Fixes ImportError: undefined symbol _ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_jb - Ensures precompiled wheels match PyTorch/CUDA environment ABI Signed-off-by: Nick Mitchell <nickm@us.ibm.com>
a9611d6 to
862bc76
Compare
Signed-off-by: Nick Mitchell <nickm@us.ibm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
Problem
The previous PR run failed with exit code 128 during cloud-init setup. Analysis of the logs showed the VM was compiling Rust dependencies but ran out of disk space, causing the build to fail and the instance to shut down prematurely.
Solution
This PR increases the disk size for dev mode (CI/testing) from 100GB to 300GB and switches to pd-balanced disk type for cost efficiency. Production mode remains unchanged at 100GB pd-ssd for optimal serving performance.
The disk configuration now adapts based on whether
SPNL_GITHUBis set:This should resolve the disk space issues during Rust compilation of SPNL and its many dependencies (geodatafusion, lance-datafusion, tantivy, etc.).