Improve PTX compilation, add full(er) support for device code on CPU by jamesmcclain · Pull Request #36 · jamesmcclain/pascal-1981

jamesmcclain · 2026-06-27T05:33:28Z

No description provided.

- Add --device-backend cuda: host emits no launch thunk/registry and no kernel-symbol reference, eliminating the dead second device compile (dev.ll). - Reference the embedded PTX as an external __pas_device_ptx symbol on the cuda backend; package PTX text as its own NUL-terminated blob object at link time. - Unify the PTX CLI into 'pascal1981 --target ptx' (--sm/--emit-llvm); keep compile_to_ptx as a deprecated alias. - Prebuild both runtime archives (libpascalrt_cpu.a / _cuda.a) once; drop the clean-rebuild-on-switch dance. - Update device-example.mk, build-cuda-host.sh, READMEs, and the plan doc; add cuda-backend decoupling regression tests.

THREADIDX_*/BLOCKIDX_*/BLOCKDIM_*/GRIDDIM_* on the CPU triple now lower to loads from _Thread_local globals (__pas_tid_x, __pas_ctaid_x, etc.) instead of baked-in constants. pas_dev_launch loops over the full gx*gy*gz x bx*by*bz geometry, setting those registers before each thunk call -- the same semantic a GPU provides via hardware special registers. BLOCKDIM_*/GRIDDIM_* default to 1 so direct (non-LAUNCH) calls retain the old single-thread behaviour. - codegen/exprs.py: CPU-triple builtins emit TLS loads, not constants - runtime/cpu_device_shim.c: define 12 TLS vars; loop in pas_dev_launch - examples/device_ptx/device-example.mk: wire DEVICE=cpu build+link - tests: update index-intrinsic test; add shim to mandelbrot_x86 link - CPU_DEVICE_TODO.md: marked done Verified: fill_indices OK all 256, mandelbrot full image -- no kernel changes, PTX output unchanged.

- device-build-cleanup-plan.md: fully implemented (commit 47ba728) - CPU_DEVICE_TODO.md: CPU device now emulates full GPU launch geometry (commit 7713a86)

The two device_ptx example READMEs and the CUDA prescription doc still described DEVICE=cpu as 'not yet wired' / a single-thread grid. The CPU shim now emulates the full launch geometry, so update both READMEs with CPU + CUDA build/run instructions and prerequisites, fix the device-example.mk header comment, and update the now-stale 'single-thread grid' language in cuda-kernel-prescription.md and the codegen docstrings to match the TLS-index-register emulation.

…quires_gpu The GPU orchestration test's _build_cuda_runtime ran 'make -C runtime clean' against the shared source-tree runtime/build/ (which every other link test links as libpascalrt.a), then rebuilt only the cuda archive. If the cuda build failed -- e.g. a driver-only box (nvidia-smi + libcuda.so.1 but no CUDA toolkit headers) -- setUpClass raised, tearDownClass never ran, and runtime/build/ was left empty, cascading link failures into every other exe-requiring test (the trailing gcc-install-dir-libstdcxx warning in each truncated summary hid the real 'no such file: libpascalrt.a' cause). Two fixes: 1. Build the CUDA shim into an ISOLATED temp copy of the runtime sources so the shared runtime/build/ is never touched. A build failure raises unittest.SkipTest (clean skip) and leaks at most a /tmp dir, never a broken source tree. tearDownClass just removes the temp dir. 2. Tighten _probe_gpu to also require cuda.h (probed the way the Makefile looks for it, -I$CUDA_HOME/include), so @requires_gpu is False on a driver-only box and the test skips at collection rather than being selected and then failing the shim build. Verified: header probe returns False for an empty CUDA_HOME, True when cuda.h is planted; full suite 848 passed, 1 skipped.

The CPU TLS work (commit 7713a86) regressed this test: it compiled the device unit to an x86 dev.ll (for the legacy launch-thunk host path), and that dev.ll now references __pas_tid_x etc. -- TLS globals defined only in cpu_device_shim.c, not the CUDA shim -- so the GPU link failed with 'undefined reference to __pas_tid_x'. Migrate the test off the legacy --embed-device-ptx path onto the decoupled cuda backend: compile the host with device_backend='cuda' (emits no launch thunk and no kernel-symbol reference, so no dev.ll is linked), objectify the PTX into a NUL-terminated __pas_device_ptx blob, and link host.ll + blob.o + cuda shim + -lcuda. This is the same 3-command flow the cleanup work established for the examples. Verified (no GPU needed): host .ll has external __pas_device_ptx, no __pas_klaunch, no kernel def; ld -r host.o + blob.o links clean -- no undefined TLS symbol. The actual CUDA shim build + run remain @requires_gpu. Full suite 848 passed, 1 skipped.

The testing section omitted that link-requiring tests link the hardcoded runtime/build/libpascalrt.a and FAIL (not skip) without it. State the prerequisite up front so a clean-tree run isn't a surprise.

Dixie Flatline a/k/a McCoy Pauley added 10 commits June 27, 2026 03:45

Add plan to collapse GPU device-build pipeline to three commands

571c9bb

Record as-built status in device-build-cleanup plan

00073fa

Update RUNNING_PTX.md to the unified --target ptx CLI

ef1acd3

Move completed plan docs to docs/old/

ab275ee

- device-build-cleanup-plan.md: fully implemented (commit 47ba728) - CPU_DEVICE_TODO.md: CPU device now emulates full GPU launch geometry (commit 7713a86)

README: state the 'make -C runtime' prerequisite for tests

bddfc95

The testing section omitted that link-requiring tests link the hardcoded runtime/build/libpascalrt.a and FAIL (not skip) without it. State the prerequisite up front so a clean-tree run isn't a surprise.

jamesmcclain changed the title ~~Improve PTX build process, add full(er) support for device code on CPU~~ Improve PTX compilation, add full(er) support for device code on CPU Jun 27, 2026

jamesmcclain merged commit c0d29f4 into master Jun 27, 2026

jamesmcclain deleted the device-cleanup branch June 27, 2026 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve PTX compilation, add full(er) support for device code on CPU#36

Improve PTX compilation, add full(er) support for device code on CPU#36
jamesmcclain merged 10 commits into
masterfrom
device-cleanup

jamesmcclain commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jamesmcclain commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant