Skip to content

Improve PTX compilation, add full(er) support for device code on CPU#36

Merged
jamesmcclain merged 10 commits into
masterfrom
device-cleanup
Jun 27, 2026
Merged

Improve PTX compilation, add full(er) support for device code on CPU#36
jamesmcclain merged 10 commits into
masterfrom
device-cleanup

Conversation

@jamesmcclain

Copy link
Copy Markdown
Owner

No description provided.

Dixie Flatline a/k/a McCoy Pauley added 10 commits June 27, 2026 03:45
- Add --device-backend cuda: host emits no launch thunk/registry and no
  kernel-symbol reference, eliminating the dead second device compile (dev.ll).
- Reference the embedded PTX as an external __pas_device_ptx symbol on the cuda
  backend; package PTX text as its own NUL-terminated blob object at link time.
- Unify the PTX CLI into 'pascal1981 --target ptx' (--sm/--emit-llvm); keep
  compile_to_ptx as a deprecated alias.
- Prebuild both runtime archives (libpascalrt_cpu.a / _cuda.a) once; drop the
  clean-rebuild-on-switch dance.
- Update device-example.mk, build-cuda-host.sh, READMEs, and the plan doc;
  add cuda-backend decoupling regression tests.
THREADIDX_*/BLOCKIDX_*/BLOCKDIM_*/GRIDDIM_* on the CPU triple now lower
to loads from _Thread_local globals (__pas_tid_x, __pas_ctaid_x, etc.)
instead of baked-in constants. pas_dev_launch loops over the full
gx*gy*gz x bx*by*bz geometry, setting those registers before each thunk
call -- the same semantic a GPU provides via hardware special registers.
BLOCKDIM_*/GRIDDIM_* default to 1 so direct (non-LAUNCH) calls retain
the old single-thread behaviour.

- codegen/exprs.py: CPU-triple builtins emit TLS loads, not constants
- runtime/cpu_device_shim.c: define 12 TLS vars; loop in pas_dev_launch
- examples/device_ptx/device-example.mk: wire DEVICE=cpu build+link
- tests: update index-intrinsic test; add shim to mandelbrot_x86 link
- CPU_DEVICE_TODO.md: marked done

Verified: fill_indices OK all 256, mandelbrot full image -- no kernel
changes, PTX output unchanged.
- device-build-cleanup-plan.md: fully implemented (commit 47ba728)
- CPU_DEVICE_TODO.md: CPU device now emulates full GPU launch geometry (commit 7713a86)
The two device_ptx example READMEs and the CUDA prescription doc still
described DEVICE=cpu as 'not yet wired' / a single-thread grid. The CPU
shim now emulates the full launch geometry, so update both READMEs with
CPU + CUDA build/run instructions and prerequisites, fix the
device-example.mk header comment, and update the now-stale
'single-thread grid' language in cuda-kernel-prescription.md and the
codegen docstrings to match the TLS-index-register emulation.
…quires_gpu

The GPU orchestration test's _build_cuda_runtime ran 'make -C runtime clean'
against the shared source-tree runtime/build/ (which every other link test
links as libpascalrt.a), then rebuilt only the cuda archive. If the cuda
build failed -- e.g. a driver-only box (nvidia-smi + libcuda.so.1 but no
CUDA toolkit headers) -- setUpClass raised, tearDownClass never ran, and
runtime/build/ was left empty, cascading link failures into every other
exe-requiring test (the trailing gcc-install-dir-libstdcxx warning in each
truncated summary hid the real 'no such file: libpascalrt.a' cause).

Two fixes:

1. Build the CUDA shim into an ISOLATED temp copy of the runtime sources so
   the shared runtime/build/ is never touched. A build failure raises
   unittest.SkipTest (clean skip) and leaks at most a /tmp dir, never a
   broken source tree. tearDownClass just removes the temp dir.

2. Tighten _probe_gpu to also require cuda.h (probed the way the Makefile
   looks for it, -I$CUDA_HOME/include), so @requires_gpu is False on a
   driver-only box and the test skips at collection rather than being
   selected and then failing the shim build.

Verified: header probe returns False for an empty CUDA_HOME, True when
cuda.h is planted; full suite 848 passed, 1 skipped.
The CPU TLS work (commit 7713a86) regressed this test: it compiled the
device unit to an x86 dev.ll (for the legacy launch-thunk host path), and
that dev.ll now references __pas_tid_x etc. -- TLS globals defined only in
cpu_device_shim.c, not the CUDA shim -- so the GPU link failed with
'undefined reference to __pas_tid_x'.

Migrate the test off the legacy --embed-device-ptx path onto the decoupled
cuda backend: compile the host with device_backend='cuda' (emits no launch
thunk and no kernel-symbol reference, so no dev.ll is linked), objectify the
PTX into a NUL-terminated __pas_device_ptx blob, and link host.ll + blob.o
+ cuda shim + -lcuda. This is the same 3-command flow the cleanup work
established for the examples.

Verified (no GPU needed): host .ll has external __pas_device_ptx, no
__pas_klaunch, no kernel def; ld -r host.o + blob.o links clean -- no
undefined TLS symbol. The actual CUDA shim build + run remain @requires_gpu.
Full suite 848 passed, 1 skipped.
The testing section omitted that link-requiring tests link the hardcoded
runtime/build/libpascalrt.a and FAIL (not skip) without it. State the
prerequisite up front so a clean-tree run isn't a surprise.
@jamesmcclain jamesmcclain changed the title Improve PTX build process, add full(er) support for device code on CPU Improve PTX compilation, add full(er) support for device code on CPU Jun 27, 2026
@jamesmcclain jamesmcclain merged commit c0d29f4 into master Jun 27, 2026
@jamesmcclain jamesmcclain deleted the device-cleanup branch June 27, 2026 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant