Improve PTX compilation, add full(er) support for device code on CPU#36
Merged
Conversation
added 10 commits
June 27, 2026 03:45
- Add --device-backend cuda: host emits no launch thunk/registry and no kernel-symbol reference, eliminating the dead second device compile (dev.ll). - Reference the embedded PTX as an external __pas_device_ptx symbol on the cuda backend; package PTX text as its own NUL-terminated blob object at link time. - Unify the PTX CLI into 'pascal1981 --target ptx' (--sm/--emit-llvm); keep compile_to_ptx as a deprecated alias. - Prebuild both runtime archives (libpascalrt_cpu.a / _cuda.a) once; drop the clean-rebuild-on-switch dance. - Update device-example.mk, build-cuda-host.sh, READMEs, and the plan doc; add cuda-backend decoupling regression tests.
THREADIDX_*/BLOCKIDX_*/BLOCKDIM_*/GRIDDIM_* on the CPU triple now lower to loads from _Thread_local globals (__pas_tid_x, __pas_ctaid_x, etc.) instead of baked-in constants. pas_dev_launch loops over the full gx*gy*gz x bx*by*bz geometry, setting those registers before each thunk call -- the same semantic a GPU provides via hardware special registers. BLOCKDIM_*/GRIDDIM_* default to 1 so direct (non-LAUNCH) calls retain the old single-thread behaviour. - codegen/exprs.py: CPU-triple builtins emit TLS loads, not constants - runtime/cpu_device_shim.c: define 12 TLS vars; loop in pas_dev_launch - examples/device_ptx/device-example.mk: wire DEVICE=cpu build+link - tests: update index-intrinsic test; add shim to mandelbrot_x86 link - CPU_DEVICE_TODO.md: marked done Verified: fill_indices OK all 256, mandelbrot full image -- no kernel changes, PTX output unchanged.
The two device_ptx example READMEs and the CUDA prescription doc still described DEVICE=cpu as 'not yet wired' / a single-thread grid. The CPU shim now emulates the full launch geometry, so update both READMEs with CPU + CUDA build/run instructions and prerequisites, fix the device-example.mk header comment, and update the now-stale 'single-thread grid' language in cuda-kernel-prescription.md and the codegen docstrings to match the TLS-index-register emulation.
…quires_gpu The GPU orchestration test's _build_cuda_runtime ran 'make -C runtime clean' against the shared source-tree runtime/build/ (which every other link test links as libpascalrt.a), then rebuilt only the cuda archive. If the cuda build failed -- e.g. a driver-only box (nvidia-smi + libcuda.so.1 but no CUDA toolkit headers) -- setUpClass raised, tearDownClass never ran, and runtime/build/ was left empty, cascading link failures into every other exe-requiring test (the trailing gcc-install-dir-libstdcxx warning in each truncated summary hid the real 'no such file: libpascalrt.a' cause). Two fixes: 1. Build the CUDA shim into an ISOLATED temp copy of the runtime sources so the shared runtime/build/ is never touched. A build failure raises unittest.SkipTest (clean skip) and leaks at most a /tmp dir, never a broken source tree. tearDownClass just removes the temp dir. 2. Tighten _probe_gpu to also require cuda.h (probed the way the Makefile looks for it, -I$CUDA_HOME/include), so @requires_gpu is False on a driver-only box and the test skips at collection rather than being selected and then failing the shim build. Verified: header probe returns False for an empty CUDA_HOME, True when cuda.h is planted; full suite 848 passed, 1 skipped.
The CPU TLS work (commit 7713a86) regressed this test: it compiled the device unit to an x86 dev.ll (for the legacy launch-thunk host path), and that dev.ll now references __pas_tid_x etc. -- TLS globals defined only in cpu_device_shim.c, not the CUDA shim -- so the GPU link failed with 'undefined reference to __pas_tid_x'. Migrate the test off the legacy --embed-device-ptx path onto the decoupled cuda backend: compile the host with device_backend='cuda' (emits no launch thunk and no kernel-symbol reference, so no dev.ll is linked), objectify the PTX into a NUL-terminated __pas_device_ptx blob, and link host.ll + blob.o + cuda shim + -lcuda. This is the same 3-command flow the cleanup work established for the examples. Verified (no GPU needed): host .ll has external __pas_device_ptx, no __pas_klaunch, no kernel def; ld -r host.o + blob.o links clean -- no undefined TLS symbol. The actual CUDA shim build + run remain @requires_gpu. Full suite 848 passed, 1 skipped.
The testing section omitted that link-requiring tests link the hardcoded runtime/build/libpascalrt.a and FAIL (not skip) without it. State the prerequisite up front so a clean-tree run isn't a surprise.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.