From 571c9bb965879c7b18bd2d04624ad3a26761c5d6 Mon Sep 17 00:00:00 2001 From: Dixie Flatline a/k/a McCoy Pauley Date: Sat, 27 Jun 2026 03:45:30 +0000 Subject: [PATCH 01/10] Add plan to collapse GPU device-build pipeline to three commands --- docs/device-build-cleanup-plan.md | 192 ++++++++++++++++++++++++++++++ 1 file changed, 192 insertions(+) create mode 100644 docs/device-build-cleanup-plan.md diff --git a/docs/device-build-cleanup-plan.md b/docs/device-build-cleanup-plan.md new file mode 100644 index 0000000..98dac92 --- /dev/null +++ b/docs/device-build-cleanup-plan.md @@ -0,0 +1,192 @@ +# Plan: collapse the GPU device-build pipeline to three commands + +Status: PROPOSED. No code changed yet — this is the design. + +## 1. Where the bodies are buried (current state) + +The end-to-end GPU path for an example (`examples/device_ptx/mandelbrot`, +`fill_indices`) is driven by `examples/device_ptx/device-example.mk` and the +hand-written `scripts/build-cuda-host.sh`. For `DEVICE=cuda` it does **five** +build actions plus a full runtime rebuild, for two source files: + +``` +1. dev.ptx = python -m pascal1981.compile_to_ptx dev.pas --cpu sm_86 # device -> PTX +2. dev.ll = python -m pascal1981 dev.pas # device -> host-x86 .ll +3. host.ll = python -m pascal1981 --embed-device-ptx dev.ptx host.pas # host -> .ll, PTX baked in +4. runtime = make -C runtime clean && make -C runtime DEVICE_SHIM=cuda # wholesale archive rebuild +5. link = clang host.ll dev.ll libpascalrt.a -L.../stubs -lcuda -o exe +``` + +(`build-cuda-host.sh` has an extra step 3 compiling the interface `.inc` too.) + +### The jank, itemized + +- **J1 — the device unit is compiled twice, for two unrelated reasons.** + Once to NVPTX PTX (the real kernel), once to a *host-x86* `.ll` whose only + job is to define the kernel symbol so the link resolves. The second compile + produces dead code: it never runs on the GPU. + +- **J2 — `dev.ll` exists solely to satisfy a link-time reference from dead + code.** Host codegen emits, for every `LAUNCH`, an internal dispatch thunk + `__pas_klaunch_` that *calls the external kernel symbol* + (`codegen/stmts.py::_kernel_launch_thunk`). That thunk is the CPU-device + stand-in; on the GPU the CUDA shim dispatches the kernel by name out of the + loaded module (`runtime/cuda_launch.c`) and the thunk is never called. But + because the thunk *statically references* `@`, the linker demands a + definition, so we drag in `dev.ll`. The reference is real; the call is dead. + +- **J3 — host `.ll` is coupled to the device artifact via `--embed-device-ptx`.** + The PTX text is baked into `host.ll` as the `__pas_device_ptx` blob at host + compile time (`codegen/stmts.py::_device_ptx_ptr`). So "compile the host" + cannot run before "compile the device," and any PTX change forces a host + recompile. The host source has nothing to do with the kernel text; this is a + packaging concern leaking into the compiler front end. + +- **J4 — two CLIs with divergent flags and defaults.** `pascal1981` and + `pascal1981.compile_to_ptx` duplicate `--device-triple`, `-f`, `--dialect`, + and disagree on defaults (`--cpu sm_70` vs none; device-triple host vs NVPTX). + The PTX driver re-implements parse/check/lower glue. + +- **J5 — the runtime archive is rebuilt from clean on every GPU build.** The cpu + and cuda shims define the same `pas_dev_*` symbols and cannot coexist in one + archive, so the Makefile's `runtime-cuda` target does `make clean && make + DEVICE_SHIM=cuda` every time. There is no prebuilt-runtime story. + +## 2. Target workflow (the goal) + +Runtime is prebuilt **once**. Then, per example, exactly three commands: + +```bash +# 1. one command against the device file -> .ptx (+ optional .ll, + embeddable object) +pascal1981 --target ptx mandelbrot.pas mandelbrot.ptx --sm sm_86 -f wide-integers + +# 2. one command against the host file -> .ll (no PTX coupling) +pascal1981 --target host --device-backend cuda mandelbrot_host.pas mandelbrot_host.ll -f wide-integers + +# 3. one clang command to link the host +clang mandelbrot_host.ll mandelbrot.ptx.o libpascalrt_cuda.a -L$CUDA/lib64/stubs -lcuda -o mandelbrot_host +``` + +`ptxas`/`cubin` stays optional (a stronger check, or an `.o` route — see §3.3). +No second device compile. No `dev.ll`. No runtime rebuild. The host `.ll` is +independent of the kernel text. + +## 3. The changes + +### 3.1 Kill `dev.ll` by gating the CPU stand-in machinery (fixes J1, J2) + +Root cause is the thunk's static reference to the kernel symbol. Add a host +compile knob `--device-backend {cpu,cuda}` (plumbed into the codegen +constructor, `codegen/base.py`). Then in `_codegen_device_orchestration` / +`_emit_launch_registry`: + +- **backend=cuda:** do **not** emit the `__pas_klaunch_` thunk or the + `__pas_klaunch_registry` table. The GPU launch path only needs + `pas_dev_module_load(registry=NULL, ptx)` → `pas_dev_module_get_function(mod, + name)` → `pas_dev_launch(entry, geom, argv)`. Pass a null registry pointer; + the cuda shim already ignores it (`runtime/cuda_launch.c::pas_dev_module_load` + casts `registry` to `(void)`). With no thunk, there is **no reference to the + kernel symbol in host `.ll`**, so the link needs no `dev.ll`. + +- **backend=cpu:** unchanged — emit thunk + registry exactly as today. The CPU + device still resolves and calls the thunk. + +Fallback if we want to keep the thunk for symmetry: emit the kernel extern as +`extern_weak` so an undefined symbol resolves to null instead of forcing a +definition. Preferred is to drop it entirely on the GPU path — less dead IR. + +Net: the GPU build compiles the device unit **once** (to PTX) and never produces +or links `dev.ll`. + +### 3.2 Decouple PTX embedding from host compile (fixes J3) + +Stop baking PTX into `host.ll`. Instead, the host references an *external* +`__pas_device_ptx` symbol, and the PTX blob becomes its own object linked at +step 3. Two ways to produce that object from `mandelbrot.ptx`; pick one: + +- **(a) emit it from the device command.** `--target ptx` also writes + `mandelbrot.ptx.o` (or `.s`) defining `const char __pas_device_ptx[]` via an + `.incbin`-style stub or `llvm-mc`. Keeps "one command against the device file" + literally true and the link a single `clang ... mandelbrot.ptx.o ...`. + +- **(b) objectify at link time** with a documented one-liner + (`ld -r -b binary`, or a 3-line `.s` using `.incbin "mandelbrot.ptx"`). The + clang link line gains one input; the host `.ll` stays pure. + +Either way `codegen/stmts.py::_device_ptx_ptr` changes from "embed the text" to +"declare `external global` `__pas_device_ptx`," and `--embed-device-ptx` +becomes optional/legacy. Host compile no longer depends on the device artifact. + +If we would rather not add a link input, the legacy `--embed-device-ptx` path +can stay as an opt-in for a strictly two-input link — but the default clean path +should decouple. + +### 3.3 Optional `ptxas` / cubin route + +For users who want the assembled artifact: `--target ptx` can additionally drive +`ptxas -arch=$SM -o mandelbrot.cubin mandelbrot.ptx` when the toolkit is +present, and §3.2's object can embed the cubin instead of PTX (the cuda shim +then `cuModuleLoadData`s a cubin, which it already accepts). This is a strict +add-on; the PTX-text path remains the no-GPU-needed default. + +### 3.4 Fold the two CLIs into one (fixes J4) + +Make `--target {host,ptx}` a flag on the single `pascal1981` driver +(`compile_to_llvm.py::main`), sharing feature resolution, dialect, and check +flags. `--target ptx` sets the device triple to `nvptx64-nvidia-cuda`, honors +`--sm` (alias the old `--cpu`), and routes through the existing +`compile_to_ptx.llvm_ir_to_ptx`. Keep `python -m pascal1981.compile_to_ptx` as a +thin shim that forwards to `--target ptx` for back-compat and existing tests +(`tests/integration/test_device_mandelbrot_ptx.py`, +`fill_indices/RUNNING_PTX.md`). + +### 3.5 Prebuild both runtime archives once (fixes J5) + +Split the shim out of the single archive so neither dominates: + +- Build a **core** archive `libpascalrt.a` (everything except the two + `*_device_shim` / `cuda_launch` shims), plus two tiny shim archives + `libpascalrt_dev_cpu.a` and `libpascalrt_dev_cuda.a`. Consumers link core + + the chosen shim. No symbol clash, no rebuild. + + Or, simpler for callers: produce two full archives `libpascalrt_cpu.a` and + `libpascalrt_cuda.a` in one `make` invocation (two `ar` outputs from one core + object set + one shim each). Either removes the `runtime-cuda` clean-rebuild. + +The example Makefile then just picks the archive; `runtime-cuda` (the phony that +does `make clean && make DEVICE_SHIM=cuda`) is deleted. + +## 4. Resulting build files + +- `device-example.mk` drops the `dev.ll` rule, the `runtime-cuda` phony, and the + `--embed-device-ptx` on the host rule. The `cuda` branch becomes: + ```make + $(BUILD)/dev.ptx: $(DEVICE_UNIT) ; $(PAS) --target ptx $< $@ --sm $(SM) $(FEATURES) + $(BUILD)/dev.o: $(BUILD)/dev.ptx ; + $(BUILD)/host.ll: $(HOST_SRC) ; $(PAS) --target host --device-backend cuda $(FEATURES) $< $@ + $(EXE): $(BUILD)/host.ll $(BUILD)/dev.o ; clang $^ $(RUNTIME_CUDA) -L$(CUDA_HOME)/lib64/stubs -lcuda -o $@ + ``` +- `scripts/build-cuda-host.sh` collapses from 6 steps to 3 (+ optional ptxas), + and stops rebuilding the runtime. + +## 5. Migration / compatibility + +- Keep `compile_to_ptx` and `--embed-device-ptx` working (deprecated aliases) so + existing tests and the `RUNNING_PTX.md` external-launcher recipe keep passing. +- CPU-device path is untouched by design (backend=cpu keeps thunk+registry); the + deferred grid-stride work in `CPU_DEVICE_TODO.md` is orthogonal. +- The PTX ABI is unchanged — same `.visible .entry`, same parameters — so the + drop-in property the mandelbrot README sells (matching `mandelbrot.cu` + symbol-for-symbol) is preserved. The validation ladder in + `RUNNING_PTX.md`/`cuda-kernel-prescription.md` still applies rung for rung. + +## 6. Validation + +- Existing PTX-text + `ptxas` checks (mandelbrot/fill READMEs) must still pass on + the new `--target ptx` output, byte-comparable to the old `compile_to_ptx`. +- A new check: host `.ll` built with `--device-backend cuda` has **no undefined + kernel symbol** and **no `__pas_klaunch_` thunk** (`grep`-able), proving J1/J2 + are gone. +- Link the three-command path on a GPU box and run the existing host programs; + output (ASCII mandelbrot, `OK: all 256 indices correct`) must be unchanged. +``` From 47ba728f614d64627ec13a46c2f2de287a3ea6da Mon Sep 17 00:00:00 2001 From: Dixie Flatline a/k/a McCoy Pauley Date: Sat, 27 Jun 2026 03:58:13 +0000 Subject: [PATCH 02/10] Collapse GPU device build to three commands - Add --device-backend cuda: host emits no launch thunk/registry and no kernel-symbol reference, eliminating the dead second device compile (dev.ll). - Reference the embedded PTX as an external __pas_device_ptx symbol on the cuda backend; package PTX text as its own NUL-terminated blob object at link time. - Unify the PTX CLI into 'pascal1981 --target ptx' (--sm/--emit-llvm); keep compile_to_ptx as a deprecated alias. - Prebuild both runtime archives (libpascalrt_cpu.a / _cuda.a) once; drop the clean-rebuild-on-switch dance. - Update device-example.mk, build-cuda-host.sh, READMEs, and the plan doc; add cuda-backend decoupling regression tests. --- docs/device-build-cleanup-plan.md | 51 +++++++++++----- examples/device_ptx/device-example.mk | 40 +++++++------ examples/device_ptx/fill_indices/README.md | 8 ++- examples/device_ptx/mandelbrot/README.md | 20 +++++-- runtime/Makefile | 69 ++++++++++++++-------- scripts/build-cuda-host.sh | 66 ++++++++++----------- src/pascal1981/codegen/__init__.py | 10 +++- src/pascal1981/codegen/base.py | 10 +++- src/pascal1981/codegen/stmts.py | 33 ++++++++++- src/pascal1981/compile_to_llvm.py | 61 ++++++++++++++++++- tests/test_device_ptx_module.py | 49 ++++++++++++++- 11 files changed, 309 insertions(+), 108 deletions(-) mode change 100644 => 100755 scripts/build-cuda-host.sh diff --git a/docs/device-build-cleanup-plan.md b/docs/device-build-cleanup-plan.md index 98dac92..84e1c51 100644 --- a/docs/device-build-cleanup-plan.md +++ b/docs/device-build-cleanup-plan.md @@ -63,8 +63,8 @@ pascal1981 --target ptx mandelbrot.pas mandelbrot.ptx --sm sm_86 -f wide-int # 2. one command against the host file -> .ll (no PTX coupling) pascal1981 --target host --device-backend cuda mandelbrot_host.pas mandelbrot_host.ll -f wide-integers -# 3. one clang command to link the host -clang mandelbrot_host.ll mandelbrot.ptx.o libpascalrt_cuda.a -L$CUDA/lib64/stubs -lcuda -o mandelbrot_host +# 3. one clang command to link the host (after objectifying the PTX blob) +clang mandelbrot_host.ll mandelbrot_ptx_blob.o libpascalrt_cuda.a -L$CUDA/lib64/stubs -lcuda -o mandelbrot_host ``` `ptxas`/`cubin` stays optional (a stronger check, or an `.o` route — see §3.3). @@ -101,21 +101,42 @@ or links `dev.ll`. ### 3.2 Decouple PTX embedding from host compile (fixes J3) Stop baking PTX into `host.ll`. Instead, the host references an *external* -`__pas_device_ptx` symbol, and the PTX blob becomes its own object linked at -step 3. Two ways to produce that object from `mandelbrot.ptx`; pick one: - -- **(a) emit it from the device command.** `--target ptx` also writes - `mandelbrot.ptx.o` (or `.s`) defining `const char __pas_device_ptx[]` via an - `.incbin`-style stub or `llvm-mc`. Keeps "one command against the device file" - literally true and the link a single `clang ... mandelbrot.ptx.o ...`. +`__pas_device_ptx` symbol (`codegen/stmts.py::_device_ptx_ptr` now declares +`@__pas_device_ptx = external constant [0 x i8]` on the cuda backend), and the +PTX blob becomes its own object linked at step 3. + +**What that object is — and is NOT.** It is an object file defining ONE data +symbol, `__pas_device_ptx`, holding the PTX **text bytes, NUL-terminated**, +because the CUDA shim reads it as a `const char *` C-string +(`runtime/cuda_launch.c` checks `ptx[0]=='\0'` then `cuModuleLoadData`s it). It +is **not** `ptxas`/cubin output. Name it for what it is — +`mandelbrot_ptx_blob.o` — **never `.ptx.o`**, which invites feeding it to the +wrong tool. Two correctness traps the naming hid: + +1. **NUL termination.** A bare `.incbin "mandelbrot.ptx"` is *not* + NUL-terminated; the stub must append a `.byte 0` or the shim reads past the + blob. +2. The object carries no code, just `.rodata`; it is produced by the assembler, + not a compiler pass. + +The objectifier is a 4-line assembly stub assembled with `clang -c`: + +```asm + .section .rodata + .globl __pas_device_ptx +__pas_device_ptx: + .incbin "mandelbrot.ptx" + .byte 0 # the C-string NUL the shim requires +``` -- **(b) objectify at link time** with a documented one-liner - (`ld -r -b binary`, or a 3-line `.s` using `.incbin "mandelbrot.ptx"`). The - clang link line gains one input; the host `.ll` stays pure. +The example Makefile / `build-cuda-host.sh` generate this stub from `dev.ptx`. +`--embed-device-ptx` stays as a legacy opt-in (host-embeds, two-input link). +With the default decoupled path, host compile no longer depends on the device +artifact. -Either way `codegen/stmts.py::_device_ptx_ptr` changes from "embed the text" to -"declare `external global` `__pas_device_ptx`," and `--embed-device-ptx` -becomes optional/legacy. Host compile no longer depends on the device artifact. +**Verified:** `host.o` built with `--device-backend cuda` shows `U +__pas_device_ptx` and no `__pas_klaunch_*` / kernel symbol; `ld -r host.o +mandelbrot_ptx_blob.o` resolves it to a defined `R __pas_device_ptx`. If we would rather not add a link input, the legacy `--embed-device-ptx` path can stay as an opt-in for a strictly two-input link — but the default clean path diff --git a/examples/device_ptx/device-example.mk b/examples/device_ptx/device-example.mk index 7d79589..2e9901d 100644 --- a/examples/device_ptx/device-example.mk +++ b/examples/device_ptx/device-example.mk @@ -21,6 +21,7 @@ THIS_MK := $(lastword $(MAKEFILE_LIST)) REPO := $(abspath $(dir $(THIS_MK))/../..) RUNTIME := $(REPO)/runtime RUNTIME_LIB := $(RUNTIME)/build/libpascalrt.a +RUNTIME_CUDA:= $(RUNTIME)/build/libpascalrt_cuda.a PAS := PYTHONPATH=$(REPO)/src python3 -m pascal1981 PTX := PYTHONPATH=$(REPO)/src python3 -m pascal1981.compile_to_ptx @@ -40,29 +41,32 @@ $(BUILD): mkdir -p $(BUILD) ifeq ($(DEVICE),cuda) -# ---- real GPU: CUDA Driver API shim + embedded PTX (Strategy 1) ------------- -# The device kernel is compiled twice, on purpose: -# * to PTX, embedded into the host so the CUDA shim cuModuleLoadData's it; -# * to a host .ll, which defines the kernel symbol the host's launch thunk -# links against (dead at run time -- the real kernel is the loaded PTX). +# ---- real GPU: CUDA Driver API shim, three commands ------------------------ +# The device kernel is compiled ONCE, to PTX (the real kernel). The host is +# compiled with --device-backend cuda, so it emits no in-process launch thunk +# and no kernel-symbol reference -- there is no second 'dev.ll' device compile. +# The PTX text is packaged as its own object (a NUL-terminated __pas_device_ptx +# byte blob the host references as an external symbol); the CUDA shim +# cuModuleLoadData's it at run time. Build the cuda runtime archive once with +# make -C runtime cuda +# (this Makefile does not rebuild it on every example build). $(BUILD)/dev.ptx: $(DEVICE_UNIT) | $(BUILD) - $(PTX) $< $@ --cpu $(SM) $(FEATURES) + $(PAS) --target ptx $< $@ --sm $(SM) $(FEATURES) -$(BUILD)/dev.ll: $(DEVICE_UNIT) | $(BUILD) - $(PAS) $(FEATURES) $< $@ +# Objectify the PTX into a single data symbol the host links against. This is a +# data blob (PTX *text* + a trailing NUL for the shim's C-string read), NOT +# ptxas/cubin output -- hence the _blob.o name, never .ptx.o. +$(BUILD)/dev_ptx_blob.s: $(BUILD)/dev.ptx | $(BUILD) + printf '\t.section .rodata\n\t.globl __pas_device_ptx\n__pas_device_ptx:\n\t.incbin "$(BUILD)/dev.ptx"\n\t.byte 0\n' > $@ -$(BUILD)/host.ll: $(HOST_SRC) $(BUILD)/dev.ptx | $(BUILD) - $(PAS) $(FEATURES) --embed-device-ptx $(BUILD)/dev.ptx $< $@ +$(BUILD)/dev_ptx_blob.o: $(BUILD)/dev_ptx_blob.s + clang -c $< -o $@ -# The runtime archive must carry the CUDA shim (cuda_launch.c). The cpu and cuda -# shims define the same symbols, so the archive is rebuilt cleanly for this mode. -.PHONY: runtime-cuda -runtime-cuda: - $(MAKE) -C $(RUNTIME) clean - $(MAKE) -C $(RUNTIME) DEVICE_SHIM=cuda +$(BUILD)/host.ll: $(HOST_SRC) | $(BUILD) + $(PAS) $(FEATURES) --device-backend cuda $< $@ -$(EXE): $(BUILD)/host.ll $(BUILD)/dev.ll runtime-cuda - clang $(BUILD)/host.ll $(BUILD)/dev.ll $(RUNTIME_LIB) \ +$(EXE): $(BUILD)/host.ll $(BUILD)/dev_ptx_blob.o + clang $(BUILD)/host.ll $(BUILD)/dev_ptx_blob.o $(RUNTIME_CUDA) \ -L$(CUDA_HOME)/lib64/stubs -lcuda -o $@ else ifeq ($(DEVICE),cpu) diff --git a/examples/device_ptx/fill_indices/README.md b/examples/device_ptx/fill_indices/README.md index 5695f14..e4dca92 100644 --- a/examples/device_ptx/fill_indices/README.md +++ b/examples/device_ptx/fill_indices/README.md @@ -106,14 +106,16 @@ From the repository root: ```bash cd examples/device_ptx/fill_indices -PYTHONPATH=../../../src python3 -m pascal1981.compile_to_ptx \ +PYTHONPATH=../../../src python3 -m pascal1981 --target ptx \ fill.pas \ fill.ptx \ --emit-llvm fill.ll \ - --cpu sm_70 + --sm sm_70 ``` -Outputs: +`--target ptx` on the single `pascal1981` driver replaces the old +`python -m pascal1981.compile_to_ptx` (still accepted as a deprecated alias; +`--sm` replaces `--cpu`). Outputs: ```text fill.ll # intermediate LLVM IR diff --git a/examples/device_ptx/mandelbrot/README.md b/examples/device_ptx/mandelbrot/README.md index 8ade651..e7b3f54 100644 --- a/examples/device_ptx/mandelbrot/README.md +++ b/examples/device_ptx/mandelbrot/README.md @@ -49,6 +49,14 @@ leaf runtime shim is C. The kernels are unchanged, so the emitted PTX remains th drop-in described next. Build rules live in [`../device-example.mk`](../device-example.mk). +The GPU build is now three commands (the runtime archive is prebuilt once with +`make -C runtime cuda`): device unit -> PTX (`--target ptx`); host program -> +`.ll` (`--device-backend cuda`, which emits no launch thunk and no kernel-symbol +reference, so there is **no** second device compile); then one `clang` link of +`host.ll` + the PTX-blob object + `libpascalrt_cuda.a` `-lcuda`. The PTX text is +packaged as its own NUL-terminated `__pas_device_ptx` data object (a `*_blob.o`, +**not** `ptxas`/cubin output) that the host references as an external symbol. + ## The ABI being matched From `mandelbrot.cu`: @@ -71,15 +79,17 @@ parameters are genuinely 32-bit; `mandelbrot_f64` uses `REAL64` (≡ `REAL`, f64 ## Build the PTX ```bash -PYTHONPATH=src python3 -m pascal1981.compile_to_ptx \ +PYTHONPATH=src python3 -m pascal1981 --target ptx \ examples/device_ptx/mandelbrot/mandelbrot.pas \ examples/device_ptx/mandelbrot/mandelbrot.ptx \ - --emit-llvm examples/device_ptx/mandelbrot/mandelbrot.ll \ - --cpu sm_86 + --sm sm_86 -f wide-integers ``` -This needs `llvmlite`/LLVM with the NVPTX backend; it needs **no** NVIDIA device, -CUDA driver/runtime, `nvcc`, or the Pascal runtime library. +`--target ptx` on the single `pascal1981` driver replaces the old +`python -m pascal1981.compile_to_ptx` (still accepted as a deprecated alias; +`--sm` replaces `--cpu`). It needs `llvmlite`/LLVM with the NVPTX backend; it +needs **no** NVIDIA device, CUDA driver/runtime, `nvcc`, or the Pascal runtime +library. ## Inspect the artifact diff --git a/runtime/Makefile b/runtime/Makefile index 607bafe..0da10c4 100644 --- a/runtime/Makefile +++ b/runtime/Makefile @@ -1,8 +1,23 @@ # Makefile for the Pascal-1981 C runtime static library. # # Usage: -# make Build libpascalrt.a in build/ +# make Build the CPU-shim archive (libpascalrt_cpu.a) + the +# back-compat alias libpascalrt.a, in build/. No GPU/CUDA +# headers required. +# make cuda Also build the CUDA-shim archive (libpascalrt_cuda.a). +# Requires the CUDA toolkit headers ($CUDA_HOME/include). +# make both Build both archives. # make clean Remove build artifacts +# +# The cpu and cuda device shims define the SAME pas_dev_* symbols, so they +# cannot coexist in one archive. Rather than rebuild-on-switch (the old +# DEVICE_SHIM clean-rebuild dance), we compile the shared core once and emit one +# archive per shim. Consumers pick the archive at link time; the core objects +# are never recompiled to switch devices. A consumer linking libpascalrt_cuda.a +# must add -lcuda on its final link line. +# +# DEVICE_SHIM is still accepted for backward compatibility: `make +# DEVICE_SHIM=cuda` builds the cuda archive as libpascalrt.a, as before. CC := clang CFLAGS := -c -O2 -Wall -Wextra @@ -11,37 +26,46 @@ ARFLAGS := rcs SRCDIR := . BUILDDIR:= build -TARGET := $(BUILDDIR)/libpascalrt.a +CUDA_HOME ?= /usr/local/cuda + +# Shared core: every .c file except the two device shims (built once, reused by +# both archives). +CORE_SRCS := $(filter-out $(SRCDIR)/cpu_device_shim.c $(SRCDIR)/cuda_launch.c,$(wildcard $(SRCDIR)/*.c)) +CORE_OBJS := $(patsubst $(SRCDIR)/%.c,$(BUILDDIR)/%.o,$(CORE_SRCS)) + +CPU_SHIM_OBJ := $(BUILDDIR)/cpu_device_shim.o +CUDA_SHIM_OBJ := $(BUILDDIR)/cuda_launch.o -# Device-orchestration shim selector: cpu (default, CPU stand-in, no GPU) or -# cuda (real CUDA Driver API). Both shims define the same pas_dev_* symbols, so -# exactly one must be in the archive. The cuda shim needs the CUDA headers to -# compile; consumers linking the cuda archive must add -lcuda on their final -# link line (or use the scripts/build-cuda-host.sh recipe). -DEVICE_SHIM ?= cpu +CPU_LIB := $(BUILDDIR)/libpascalrt_cpu.a +CUDA_LIB := $(BUILDDIR)/libpascalrt_cuda.a +ALIAS_LIB := $(BUILDDIR)/libpascalrt.a +.PHONY: all cuda both clean cleaner + +# ---- Backward-compatible DEVICE_SHIM override ------------------------------ +# `make DEVICE_SHIM=cuda` builds the cuda archive as the legacy libpascalrt.a. ifeq ($(DEVICE_SHIM),cuda) -SHIM_EXCLUDE := cpu_device_shim.c -CUDA_HOME ?= /usr/local/cuda -CFLAGS += -I$(CUDA_HOME)/include -else ifeq ($(DEVICE_SHIM),cpu) -SHIM_EXCLUDE := cuda_launch.c +all: $(CUDA_LIB) + cp $(CUDA_LIB) $(ALIAS_LIB) else -$(error DEVICE_SHIM must be 'cpu' or 'cuda', got '$(DEVICE_SHIM)') +# Default: CPU archive + the back-compat alias name (libpascalrt.a == cpu). +all: $(CPU_LIB) + cp $(CPU_LIB) $(ALIAS_LIB) endif -# Every .c file in this directory is part of the runtime, except the shim that -# was not selected. -SRCS := $(filter-out $(SRCDIR)/$(SHIM_EXCLUDE),$(wildcard $(SRCDIR)/*.c)) -OBJS := $(patsubst $(SRCDIR)/%.c,$(BUILDDIR)/%.o,$(SRCS)) - -.PHONY: all clean +cuda: $(CUDA_LIB) +both: $(CPU_LIB) $(CUDA_LIB) -all: $(TARGET) +$(CPU_LIB): $(CORE_OBJS) $(CPU_SHIM_OBJ) | $(BUILDDIR) + $(AR) $(ARFLAGS) $@ $^ -$(TARGET): $(OBJS) | $(BUILDDIR) +$(CUDA_LIB): $(CORE_OBJS) $(CUDA_SHIM_OBJ) | $(BUILDDIR) $(AR) $(ARFLAGS) $@ $^ +# The cuda shim is the only object that needs the CUDA headers. +$(CUDA_SHIM_OBJ): $(SRCDIR)/cuda_launch.c pascalrt.h | $(BUILDDIR) + $(CC) $(CFLAGS) -I$(CUDA_HOME)/include -o $@ $< + $(BUILDDIR)/%.o: $(SRCDIR)/%.c pascalrt.h | $(BUILDDIR) $(CC) $(CFLAGS) -o $@ $< @@ -52,4 +76,3 @@ clean: rm -rf $(BUILDDIR) cleaner: clean - rm -f $(TARGET) diff --git a/scripts/build-cuda-host.sh b/scripts/build-cuda-host.sh old mode 100644 new mode 100755 index 236717e..4545485 --- a/scripts/build-cuda-host.sh +++ b/scripts/build-cuda-host.sh @@ -2,68 +2,64 @@ # # End-to-end recipe: compile a Pascal DEVICE UNIT + host PROGRAM and run the # kernel on a real GPU through the CUDA Driver API shim (cuda-kernel-prescription -# §5.2 Strategy 1). This is the GPU counterpart of the CPU-device stand-in; the -# Pascal sources are byte-for-byte the same, only the runtime shim differs. +# §5.2 Strategy 1). The Pascal sources are byte-for-byte the same as the CPU +# stand-in; only the runtime shim differs. # -# Pipeline: -# 1. device unit -> PTX (--device-triple nvptx64-nvidia-cuda, via compile_to_ptx) -# 2. device unit -> host x86 .ll (defines the kernel symbol the host launch -# thunk references; dead code at run time, -# the real kernel comes from the PTX) -# 3. interface -> .ll -# 4. host program -> .ll, embedding the PTX via --embed-device-ptx -# 5. link main.ll + device .ll + the CUDA runtime archive + -lcuda -# 6. run on the GPU +# Three commands (the runtime archive is prebuilt, not rebuilt here): +# 1. device unit -> PTX (pascal1981 --target ptx) +# 2. host program -> .ll (pascal1981 --device-backend cuda) +# 3. objectify the PTX blob + link (clang) +# +# The host is compiled with --device-backend cuda, so it emits no in-process +# launch thunk and no kernel-symbol reference -- there is no second device +# compile ('dev.ll'). The PTX text is packaged as its own data object, a +# NUL-terminated __pas_device_ptx blob the host references as an external symbol; +# the CUDA shim cuModuleLoadData's it at run time. (This is a data blob, NOT +# ptxas/cubin output -- hence _blob.o, never .ptx.o.) # # Usage: -# scripts/build-cuda-host.sh DEVICE_UNIT.pas IFACE.inc HOST_MAIN.pas OUT_EXE \ +# scripts/build-cuda-host.sh DEVICE_UNIT.pas HOST_MAIN.pas OUT_EXE \ # [-- extra pascal1981 feature flags, e.g. -f wide-integers] # # Requirements: an NVIDIA GPU + driver, llvmlite with the NVPTX backend, clang, -# and the CUDA toolkit headers. Build the runtime archive with the CUDA shim -# first: make -C runtime DEVICE_SHIM=cuda +# and the CUDA toolkit headers. Build the cuda runtime archive once first: +# make -C runtime cuda set -euo pipefail -if [ "$#" -lt 4 ]; then +if [ "$#" -lt 3 ]; then sed -n '2,30p' "$0" exit 2 fi -DEVICE_UNIT=$1; IFACE=$2; HOST_MAIN=$3; OUT_EXE=$4; shift 4 +DEVICE_UNIT=$1; HOST_MAIN=$2; OUT_EXE=$3; shift 3 PAS_FLAGS=() if [ "${1:-}" = "--" ]; then shift; PAS_FLAGS=("$@"); fi REPO_ROOT=$(cd "$(dirname "$0")/.." && pwd) CUDA_HOME=${CUDA_HOME:-/usr/local/cuda} SM=${SM:-sm_89} -RUNTIME_LIB=$REPO_ROOT/runtime/build/libpascalrt.a +RUNTIME_CUDA=$REPO_ROOT/runtime/build/libpascalrt_cuda.a WORK=$(mktemp -d) trap 'rm -rf "$WORK"' EXIT PAS() { PYTHONPATH="$REPO_ROOT/src" python3 -m pascal1981 "$@"; } -PTX() { PYTHONPATH="$REPO_ROOT/src" python3 -m pascal1981.compile_to_ptx "$@"; } -# Ensure the CUDA shim is in the runtime archive (rebuild if missing/stale). -if ! ar t "$RUNTIME_LIB" 2>/dev/null | grep -q '^cuda_launch.o$'; then - echo ">> building runtime with the CUDA shim (DEVICE_SHIM=cuda)" >&2 - make -C "$REPO_ROOT/runtime" clean >/dev/null - make -C "$REPO_ROOT/runtime" DEVICE_SHIM=cuda >/dev/null +if [ ! -f "$RUNTIME_CUDA" ]; then + echo ">> building the cuda runtime archive (make -C runtime cuda)" >&2 + make -C "$REPO_ROOT/runtime" cuda >/dev/null fi echo ">> 1. device unit -> PTX" >&2 -PTX "$DEVICE_UNIT" "$WORK/dev.ptx" --cpu "$SM" "${PAS_FLAGS[@]}" - -echo ">> 2. device unit -> host .ll (defines the kernel symbol)" >&2 -PAS "${PAS_FLAGS[@]}" "$DEVICE_UNIT" "$WORK/dev.ll" >/dev/null - -echo ">> 3. interface -> .ll" >&2 -PAS "${PAS_FLAGS[@]}" "$IFACE" "$WORK/iface.ll" >/dev/null +PAS --target ptx "$DEVICE_UNIT" "$WORK/dev.ptx" --sm "$SM" "${PAS_FLAGS[@]}" >/dev/null -echo ">> 4. host program -> .ll (embedding PTX)" >&2 -PAS "${PAS_FLAGS[@]}" --embed-device-ptx "$WORK/dev.ptx" "$HOST_MAIN" "$WORK/main.ll" >/dev/null +echo ">> 2. host program -> .ll (device-backend cuda)" >&2 +PAS "${PAS_FLAGS[@]}" --device-backend cuda "$HOST_MAIN" "$WORK/host.ll" >/dev/null -echo ">> 5. link host + device .ll + CUDA shim" >&2 -clang "$WORK/main.ll" "$WORK/dev.ll" "$RUNTIME_LIB" \ +echo ">> 3. objectify PTX blob + link" >&2 +printf '\t.section .rodata\n\t.globl __pas_device_ptx\n__pas_device_ptx:\n\t.incbin "%s"\n\t.byte 0\n' \ + "$WORK/dev.ptx" > "$WORK/dev_ptx_blob.s" +clang -c "$WORK/dev_ptx_blob.s" -o "$WORK/dev_ptx_blob.o" +clang "$WORK/host.ll" "$WORK/dev_ptx_blob.o" "$RUNTIME_CUDA" \ -L"$CUDA_HOME/lib64/stubs" -lcuda -o "$OUT_EXE" -echo ">> 6. done: $OUT_EXE" >&2 +echo ">> done: $OUT_EXE" >&2 diff --git a/src/pascal1981/codegen/__init__.py b/src/pascal1981/codegen/__init__.py index 1dd9372..c841cfe 100644 --- a/src/pascal1981/codegen/__init__.py +++ b/src/pascal1981/codegen/__init__.py @@ -48,7 +48,8 @@ def __init__(self, host_triple: str = "x86_64-pc-linux-gnu", is_root_compiland: bool = True, is_device_compiland: bool = False, - embed_device_ptx_text: Optional[str] = None): + embed_device_ptx_text: Optional[str] = None, + device_backend: str = 'cpu'): """Initialize Codegen with all mixins.""" super().__init__(verbose=verbose, source_file=source_file, @@ -58,7 +59,8 @@ def __init__(self, host_triple=host_triple, is_root_compiland=is_root_compiland, is_device_compiland=is_device_compiland, - embed_device_ptx_text=embed_device_ptx_text) + embed_device_ptx_text=embed_device_ptx_text, + device_backend=device_backend) # ======================================================================== # Type System @@ -87,6 +89,7 @@ def compile_to_llvm( device_triple: str = "x86_64-pc-linux-gnu", host_triple: str = "x86_64-pc-linux-gnu", embed_device_ptx_text: Optional[str] = None, + device_backend: str = 'cpu', # Legacy compat: force_rangeck=True/False is equivalent to # force_flags={'RANGECK': True/False}. force_rangeck: Optional[bool] = None) -> str: @@ -114,7 +117,8 @@ def compile_to_llvm( host_triple=host_triple, is_root_compiland=is_root_compiland, is_device_compiland=is_device_compiland, - embed_device_ptx_text=embed_device_ptx_text) + embed_device_ptx_text=embed_device_ptx_text, + device_backend=device_backend) module = codegen.codegen(ast) return str(module) diff --git a/src/pascal1981/codegen/base.py b/src/pascal1981/codegen/base.py index 01c81ee..4514e34 100644 --- a/src/pascal1981/codegen/base.py +++ b/src/pascal1981/codegen/base.py @@ -100,7 +100,8 @@ def __init__(self, host_triple: str = "x86_64-pc-linux-gnu", is_root_compiland: bool = True, is_device_compiland: bool = False, - embed_device_ptx_text: Optional[str] = None): + embed_device_ptx_text: Optional[str] = None, + device_backend: str = 'cpu'): # Each compilation gets its own LLVM context. Identified struct types # (used for named records, so self-referential linked-list nodes can # build) are interned by name *within a context*; the default global @@ -211,6 +212,13 @@ def __init__(self, # embedding *mechanism* is always present so the GPU swap is a runtime # change, but the CPU-device path never executes the PTX. self._embed_device_ptx_text: Optional[str] = embed_device_ptx_text + # Host launch backend: 'cpu' (CPU-device stand-in) emits the per-kernel + # dispatch thunk + registry that resolves and calls the kernel in-process; + # 'cuda' targets the real CUDA Driver API shim, where the kernel is the + # loaded PTX module and the host never references the kernel symbol -- so + # the thunk/registry (and the dead link-time kernel reference they force, + # i.e. the second 'dev.ll' device compile) are suppressed entirely. + self.device_backend: str = device_backend self._build_extern_factories() # INPUT/OUTPUT: only PROGRAM owns the strong definition; MODULE and # UNIT compilands emit declare-only (external global) so the linker diff --git a/src/pascal1981/codegen/stmts.py b/src/pascal1981/codegen/stmts.py index 988c8f6..0cb3da3 100644 --- a/src/pascal1981/codegen/stmts.py +++ b/src/pascal1981/codegen/stmts.py @@ -502,7 +502,15 @@ def _codegen_device_orchestration(self, name: str, args: list) -> None: # get_function returns the thunk, and launch calls it; on the GPU the # same three calls become cuModuleLoadData(ptx) / cuModuleGetFunction / # cuLaunchKernel, with no change here. - self._record_launched_kernel(fn.name, self._kernel_launch_thunk(fn)) + # On the CPU-device backend the launch is dispatched in-process through a + # per-kernel thunk recorded in this compiland's registry; that thunk + # statically references the kernel symbol, which is what forces the + # separate host-ABI device compile (dev.ll) at link time. On the CUDA + # backend the kernel is the loaded PTX module and the shim dispatches it + # by name, so we emit neither thunk nor registry -- the host .ll then has + # no undefined kernel symbol and needs no dev.ll. + if self.device_backend != 'cuda': + self._record_launched_kernel(fn.name, self._kernel_launch_thunk(fn)) module = self.builder.call( self.runtime_extern('pas_dev_module_load'), [self._launch_registry_ptr(), self._device_ptx_ptr()]) @@ -527,6 +535,12 @@ def _launch_registry_ptr(self) -> ir.Value: """ i8p = ir.IntType(8).as_pointer() i64 = ir.IntType(64) + # CUDA backend: there is no in-process registry (the kernel is the loaded + # PTX module and the shim ignores this argument), so pass a null pointer + # rather than referencing an external registry global that nothing + # defines -- which would otherwise be an undefined symbol at link. + if self.device_backend == 'cuda': + return ir.Constant(i8p, None) if self._launch_registry_gv is None: reg_ty = ir.LiteralStructType([i8p.as_pointer(), i8p.as_pointer(), i64]) self._launch_registry_gv = ir.GlobalVariable( @@ -545,7 +559,20 @@ def _device_ptx_ptr(self) -> ir.Value: """ i8 = ir.IntType(8) i8p = i8.as_pointer() + zero = ir.Constant(ir.IntType(32), 0) if self._device_ptx_gv is None: + if self.device_backend == 'cuda' and not self._embed_device_ptx_text: + # CUDA backend, decoupled packaging: the PTX blob is its own + # object (built from the .ptx at link time), referenced here as + # an external `const char __pas_device_ptx[]`. The host .ll no + # longer needs the kernel text baked in, so host compile does not + # depend on the device artifact. + gv = ir.GlobalVariable(self.module, ir.ArrayType(i8, 0), + name='__pas_device_ptx') + gv.global_constant = True + gv.linkage = 'external' + self._device_ptx_gv = gv + return self.builder.bitcast(gv, i8p) text = self._embed_device_ptx_text or '' data = bytearray(text.encode('utf-8') + b'\0') const = ir.Constant(ir.ArrayType(i8, len(data)), data) @@ -553,7 +580,9 @@ def _device_ptx_ptr(self) -> ir.Value: gv.global_constant = True gv.initializer = const self._device_ptx_gv = gv - zero = ir.Constant(ir.IntType(32), 0) + if isinstance(self._device_ptx_gv.type.pointee, ir.ArrayType) and \ + self._device_ptx_gv.type.pointee.count == 0: + return self.builder.bitcast(self._device_ptx_gv, i8p) return self.builder.gep(self._device_ptx_gv, [zero, zero]) def _emit_launch_registry(self) -> None: diff --git a/src/pascal1981/compile_to_llvm.py b/src/pascal1981/compile_to_llvm.py index 8275211..bd75c2f 100644 --- a/src/pascal1981/compile_to_llvm.py +++ b/src/pascal1981/compile_to_llvm.py @@ -45,6 +45,28 @@ def main() -> int: help='LLVM target triple for DEVICE MODULE units; e.g. nvptx64-nvidia-cuda or ' 'amdgcn-amd-amdhsa. Defaults to the host x86 triple (CPU-device: address ' 'spaces collapse to addrspace 0).') + parser.add_argument('--target', + choices=['host', 'ptx'], + default='host', + help='Output target: host LLVM IR (.ll, default) or device NVPTX assembly ' + '(.ptx). --target ptx selects the NVPTX device triple and honors --sm; it ' + 'is the single-CLI replacement for python -m pascal1981.compile_to_ptx.') + parser.add_argument('--sm', + default='sm_70', + metavar='ARCH', + help='NVPTX target CPU for --target ptx, e.g. sm_70, sm_86 (default: sm_70).') + parser.add_argument('--emit-llvm', + default=None, + metavar='PATH', + help='With --target ptx, also write the intermediate NVPTX LLVM IR to PATH.') + parser.add_argument('--device-backend', + choices=['cpu', 'cuda'], + default='cpu', + help='Host launch backend for LAUNCH lowering. cpu (default): emit the ' + 'in-process dispatch thunk + registry (CPU-device stand-in). cuda: target ' + 'the CUDA Driver API shim -- the kernel is the loaded PTX module, so no ' + 'thunk/registry and no dead kernel-symbol reference are emitted (no dev.ll ' + 'needed at link).') parser.add_argument('--embed-device-ptx', default=None, metavar='PTX_FILE', @@ -84,6 +106,42 @@ def main() -> int: print(runtime_lib_path()) return 0 + if args.target == 'ptx': + # Single-CLI device path: parse/check/lower to NVPTX IR, then PTX. + from .compile_to_ptx import compile_file_to_ptx + try: + features = resolve_features(args.dialect, args.feature) + except ValueError as exc: + parser.error(str(exc)) + if not args.source_file: + parser.error('--target ptx requires a source file') + try: + device_triple = args.device_triple + if device_triple == 'x86_64-pc-linux-gnu': + device_triple = 'nvptx64-nvidia-cuda' + ptx = compile_file_to_ptx( + args.source_file, + host_triple=args.host_triple, + device_triple=device_triple, + cpu=args.sm, + features=features, + emit_llvm_path=args.emit_llvm, + ) + if args.output_file: + with open(args.output_file, 'w') as f: + f.write(ptx) + print(f'Wrote {args.output_file}', file=sys.stderr) + else: + print(ptx) + return 0 + except Exception as exc: + print(f'Error: {exc}', file=sys.stderr) + if args.verbose: + traceback.print_exc() + else: + print('(re-run with -v for a full traceback)', file=sys.stderr) + return 1 + if args.list_features: for feature in all_features(): print(f'{feature.name}\tdefault={str(feature.default).lower()}\t{feature.help}') @@ -164,7 +222,8 @@ def main() -> int: features=features, host_triple=args.host_triple, device_triple=args.device_triple, - embed_device_ptx_text=embed_device_ptx_text) + embed_device_ptx_text=embed_device_ptx_text, + device_backend=args.device_backend) # Output if output_file: diff --git a/tests/test_device_ptx_module.py b/tests/test_device_ptx_module.py index 324e09e..bd26c51 100644 --- a/tests/test_device_ptx_module.py +++ b/tests/test_device_ptx_module.py @@ -79,7 +79,7 @@ """ -def _compile_main_ir(proj_files, *, embed_ptx=None): +def _compile_main_ir(proj_files, *, embed_ptx=None, device_backend='cpu'): """Compile main.pas of a project to host IR, optionally embedding PTX.""" with temporary_pascal_project(proj_files) as proj: main_path = os.path.join(proj, 'main.pas') @@ -87,7 +87,8 @@ def _compile_main_ir(proj_files, *, embed_ptx=None): result = PascalTypeChecker(source_file=main_path, features=_WIDE).check(ast) assert result.success, result.errors return compile_to_llvm(ast, source_file=main_path, features=_WIDE, - embed_device_ptx_text=embed_ptx) + embed_device_ptx_text=embed_ptx, + device_backend=device_backend) @requires_llvm @@ -125,6 +126,50 @@ def test_empty_blob_emitted_without_ptx(self): self.assertIn('@"__pas_device_ptx" = constant [1 x i8]', ir) +@requires_llvm +class TestCudaBackendDecoupling(unittest.TestCase): + """--device-backend cuda removes the CPU stand-in machinery entirely. + + On the CUDA backend the kernel is the loaded PTX module and the shim + dispatches it by name, so the host must NOT emit the per-kernel dispatch + thunk, the registry, or any reference to the kernel symbol -- those were the + only reason the device unit had to be compiled a second time (dev.ll) and + linked into the host. The PTX blob is referenced as an external symbol + (its own object at link time), so host compile no longer depends on the + device artifact. + """ + + def test_no_thunk_no_registry_no_kernel_ref(self): + ir = _compile_main_ir({'vadd.inc': _IFACE, 'main.pas': _MAIN}, + device_backend='cuda') + # The three-step driver path is still emitted... + self.assertIn('pas_dev_module_load', ir) + self.assertIn('pas_dev_module_get_function', ir) + self.assertIn('pas_dev_launch', ir) + # ...but with none of the CPU stand-in scaffolding. + self.assertNotIn('__pas_klaunch', ir) # no thunk, no registry + self.assertNotIn('define i32 @"add"', ir) # no kernel definition + # The kernel symbol is never *referenced* (an unused extern declare is + # harmless; a call/thunk would force the dead dev.ll link). + self.assertNotIn('call void @"add"', ir) + + def test_ptx_blob_is_external_not_embedded(self): + ir = _compile_main_ir({'vadd.inc': _IFACE, 'main.pas': _MAIN}, + device_backend='cuda') + # Host references the blob as an external symbol; the bytes live in a + # separate object built from the .ptx at link time. + self.assertIn('@"__pas_device_ptx" = external constant', ir) + + def test_explicit_embed_still_wins_on_cuda_backend(self): + # Legacy opt-in: --embed-device-ptx still bakes the bytes in even on the + # cuda backend (two-input link), so the old path keeps working. + ptx = '.visible .entry add() { ret; }\n' + ir = _compile_main_ir({'vadd.inc': _IFACE, 'main.pas': _MAIN}, + embed_ptx=ptx, device_backend='cuda') + self.assertNotIn('external constant', ir.split('__pas_device_ptx')[1][:40]) + self.assertIn('visible .entry add', ir) + + @requires_llvm class TestRegistryDedup(unittest.TestCase): From 00073fa7cf73075503397dcb075dc2a54a1d4a06 Mon Sep 17 00:00:00 2001 From: Dixie Flatline a/k/a McCoy Pauley Date: Sat, 27 Jun 2026 04:04:43 +0000 Subject: [PATCH 03/10] Record as-built status in device-build-cleanup plan --- docs/device-build-cleanup-plan.md | 37 ++++++++++++++++++++++++++++++- 1 file changed, 36 insertions(+), 1 deletion(-) diff --git a/docs/device-build-cleanup-plan.md b/docs/device-build-cleanup-plan.md index 84e1c51..cbeabe1 100644 --- a/docs/device-build-cleanup-plan.md +++ b/docs/device-build-cleanup-plan.md @@ -1,6 +1,41 @@ # Plan: collapse the GPU device-build pipeline to three commands -Status: PROPOSED. No code changed yet — this is the design. +Status: IMPLEMENTED (commit 47ba728), with one deliberately-deferred optional +item. See §7 for the as-built status against this design. + +## 7. As-built status + +Landed as planned: + +- **§3.1 — `dev.ll` killed.** `--device-backend cuda` suppresses the + `__pas_klaunch_*` thunk and registry and passes a null registry pointer; host + `.ll` carries no kernel-symbol reference, so no second device compile is + linked. Verified by grep + `ld -r` resolution. +- **§3.2 — PTX decoupled.** Host references an external `__pas_device_ptx` + symbol; the PTX text is packaged as a NUL-terminated `*_blob.o` via an + `.incbin` assembly stub. `--embed-device-ptx` retained as a legacy opt-in. +- **§3.5 — both runtime archives prebuilt** (`libpascalrt_{cpu,cuda}.a`, two + full archives in one `make`; the "simpler" variant). `runtime-cuda` + clean-rebuild phony deleted. +- **§4 / §5 — build files + migration.** `device-example.mk` and + `build-cuda-host.sh` reduced to the three-command flow; `compile_to_ptx`, + `--embed-device-ptx`, and the CPU path all still work; PTX ABI unchanged. +- **§6 — validation (2 of 3 rungs).** New `--target ptx` output is byte-identical + to the pre-change tree (diffed against 571c9bb); a regression test pins + "no thunk / no kernel ref / external PTX symbol" on the cuda backend. + +Deviations / not done: + +- **§3.3 (optional `ptxas`/cubin route) — NOT implemented.** Marked optional; no + `ptxas` driving or cubin embedding was added. Future add-on. +- **§3.4 — built in the reverse direction.** Rather than making `compile_to_ptx` + forward to `--target ptx`, `--target ptx` calls into + `compile_to_ptx.compile_file_to_ptx` and the old CLI is kept intact as the + alias. Functionally identical (single driver, shared flags, byte-identical + PTX); only the dependency direction differs from the text above. +- **§6 on-GPU run — environmentally blocked.** No NVIDIA device/`ptxas` in the + dev VM, so the final "link + run on a GPU box" rung and the `ptxas` text + checks remain unexecuted here. ## 1. Where the bodies are buried (current state) From ef1acd3df9e4e86e119de99b6f3715bc500c00bb Mon Sep 17 00:00:00 2001 From: Dixie Flatline a/k/a McCoy Pauley Date: Sat, 27 Jun 2026 04:22:32 +0000 Subject: [PATCH 04/10] Update RUNNING_PTX.md to the unified --target ptx CLI --- examples/device_ptx/fill_indices/RUNNING_PTX.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/examples/device_ptx/fill_indices/RUNNING_PTX.md b/examples/device_ptx/fill_indices/RUNNING_PTX.md index 65fcc41..64833c0 100644 --- a/examples/device_ptx/fill_indices/RUNNING_PTX.md +++ b/examples/device_ptx/fill_indices/RUNNING_PTX.md @@ -33,13 +33,17 @@ not the same as a successful CUDA launch. The machine knows when you lie. From this example directory in the Pascal repository: ```bash -PYTHONPATH=../../../src python3 -m pascal1981.compile_to_ptx \ +PYTHONPATH=../../../src python3 -m pascal1981 --target ptx \ fill.pas \ fill.ptx \ --emit-llvm fill.ll \ - --cpu sm_70 + --sm sm_70 ``` +(`--target ptx` on the single `pascal1981` driver replaces the old +`python -m pascal1981.compile_to_ptx`, still accepted as a deprecated alias; +`--sm` replaces `--cpu`.) + Inspect: ```bash From 7713a86313c6c928f7e99fbfe66b5e1ba43d4c1b Mon Sep 17 00:00:00 2001 From: Dixie Flatline a/k/a McCoy Pauley Date: Sat, 27 Jun 2026 04:47:51 +0000 Subject: [PATCH 05/10] CPU device emulates full GPU launch geometry via TLS index registers THREADIDX_*/BLOCKIDX_*/BLOCKDIM_*/GRIDDIM_* on the CPU triple now lower to loads from _Thread_local globals (__pas_tid_x, __pas_ctaid_x, etc.) instead of baked-in constants. pas_dev_launch loops over the full gx*gy*gz x bx*by*bz geometry, setting those registers before each thunk call -- the same semantic a GPU provides via hardware special registers. BLOCKDIM_*/GRIDDIM_* default to 1 so direct (non-LAUNCH) calls retain the old single-thread behaviour. - codegen/exprs.py: CPU-triple builtins emit TLS loads, not constants - runtime/cpu_device_shim.c: define 12 TLS vars; loop in pas_dev_launch - examples/device_ptx/device-example.mk: wire DEVICE=cpu build+link - tests: update index-intrinsic test; add shim to mandelbrot_x86 link - CPU_DEVICE_TODO.md: marked done Verified: fill_indices OK all 256, mandelbrot full image -- no kernel changes, PTX output unchanged. --- examples/device_ptx/CPU_DEVICE_TODO.md | 25 ++++++++- examples/device_ptx/device-example.mk | 26 +++++---- examples/device_ptx/fill_indices/fill_host | Bin 0 -> 18240 bytes .../device_ptx/mandelbrot/mandelbrot_host | Bin 0 -> 22520 bytes runtime/cpu_device_shim.c | 51 +++++++++++++++--- src/pascal1981/codegen/exprs.py | 42 ++++++++++++--- .../integration/test_device_mandelbrot_x86.py | 10 ++-- tests/test_device_index_intrinsics.py | 10 ++-- 8 files changed, 132 insertions(+), 32 deletions(-) create mode 100755 examples/device_ptx/fill_indices/fill_host create mode 100755 examples/device_ptx/mandelbrot/mandelbrot_host diff --git a/examples/device_ptx/CPU_DEVICE_TODO.md b/examples/device_ptx/CPU_DEVICE_TODO.md index 08e2eb6..6fa198f 100644 --- a/examples/device_ptx/CPU_DEVICE_TODO.md +++ b/examples/device_ptx/CPU_DEVICE_TODO.md @@ -1,4 +1,4 @@ -# CPU device support for the `device_ptx` examples — future work +# CPU device support for the `device_ptx` examples — DONE The Makefiles in `fill_indices/` and `mandelbrot/` accept `DEVICE=cpu` and `DEVICE=cuda`. Only `DEVICE=cuda` is wired today; `DEVICE=cpu` prints a pointer @@ -31,7 +31,28 @@ Both example kernels are one-thread-per-element: This is a property of the kernels, not the orchestration or the shim. -## What enabling CPU needs: grid-stride kernels +## How it was fixed (implemented) + +Rather than changing the kernels, the CPU shim was made to actually emulate GPU +execution: + +1. **Compiler (`codegen/exprs.py`)**: on the CPU triple, `THREADIDX_*`, + `BLOCKIDX_*`, `BLOCKDIM_*`, `GRIDDIM_*` now lower to **loads from + thread-local globals** (`__pas_tid_x`, `__pas_ctaid_x`, etc.) instead of + baked-in constants. The runtime defines these. + +2. **CPU shim (`runtime/cpu_device_shim.c`)**: `pas_dev_launch` now loops over + the full launch geometry (`gx*gy*gz` blocks × `bx*by*bz` threads), setting + the TLS index registers before each thunk call. `BLOCKDIM_*`/`GRIDDIM_*` + default to 1 so direct (non-LAUNCH) kernel calls still work. + +3. **Makefile (`device-example.mk`)**: the `DEVICE=cpu` stub now builds and + links `dev.ll` + `host.ll` against `libpascalrt_cpu.a`. + +The kernels are unchanged. `make DEVICE=cpu run` now produces correct output for +both `fill_indices` (all 256 indices correct) and `mandelbrot` (full image). + +## What was previously needed (now moot): grid-stride kernels Make each kernel iterate its whole index space with a grid-stride loop instead of handling a single element. For a 1-D kernel: diff --git a/examples/device_ptx/device-example.mk b/examples/device_ptx/device-example.mk index 2e9901d..6d071b6 100644 --- a/examples/device_ptx/device-example.mk +++ b/examples/device_ptx/device-example.mk @@ -70,15 +70,23 @@ $(EXE): $(BUILD)/host.ll $(BUILD)/dev_ptx_blob.o -L$(CUDA_HOME)/lib64/stubs -lcuda -o $@ else ifeq ($(DEVICE),cpu) -# ---- CPU device: FUTURE WORK (see CPU_DEVICE_TODO.md) ----------------------- -# The host orchestration already works on the CPU shim; what's missing is kernel -# coverage. The CPU device runs a single-thread grid, so a one-thread-per-element -# kernel computes only element 0. Enabling this is a kernel change, deferred. -$(EXE): - @echo "DEVICE=cpu is not yet wired for this example." >&2 - @echo "See examples/device_ptx/CPU_DEVICE_TODO.md for why and what it" >&2 - @echo "needs. For now, build and run on a GPU with: make DEVICE=cuda" >&2 - @false +# ---- CPU device: full-grid emulation via thread-local index registers ------- +# The CPU shim now emulates a GPU launch: pas_dev_launch loops over the full +# gx*gy*gz x bx*by*bz grid, setting thread-local __pas_tid_*/ __pas_ctaid_* +# globals before each thunk call so the kernel sees the correct indices. +# The device unit compiles to the host triple (no PTX), and links alongside +# the host .ll against libpascalrt_cpu.a. No GPU or CUDA toolkit required. +# +# Build the cpu runtime archive once with: make -C runtime +# (this Makefile does not rebuild it on every example build). +$(BUILD)/dev.ll: $(DEVICE_UNIT) | $(BUILD) + $(PAS) $(FEATURES) $< $@ + +$(BUILD)/host.ll: $(HOST_SRC) | $(BUILD) + $(PAS) $(FEATURES) $< $@ + +$(EXE): $(BUILD)/host.ll $(BUILD)/dev.ll + clang $(BUILD)/host.ll $(BUILD)/dev.ll $(RUNTIME_LIB) -lm -o $@ else $(error DEVICE must be 'cpu' or 'cuda', got '$(DEVICE)') diff --git a/examples/device_ptx/fill_indices/fill_host b/examples/device_ptx/fill_indices/fill_host new file mode 100755 index 0000000000000000000000000000000000000000..e56b03f3085a78c87f086ee0bf74c8aafcf89ea9 GIT binary patch literal 18240 zcmeHPeQ;dWb-ycr31Vy4V1xJyFV4y}1}|EFRkl&swYBD{NFa=zzz|rk_Cs1&X_wu% zMmDYykwvvx6q}9-o^${=$pFJJiJQ8_5Gd;pEI|xoQ~x?tSmwci-N5{riiZz;tDQ>_;f+xYdQns z6LrFg^E~ktF&+F=iK%j*A}CeKoHCYHYMfxwdrFtv>BU-(DSJqi^d?HXbp=d?ZU^a2 zBPGq->4vkFMW$R{s?Sj6!bT14KhttOicR|>rPp3RmJU@YzIukDtVb~E?bmwywH{NU z^_X%zDJFFEX@Bh0jWSA1NtO~(I!?|b?u*ors3pG&*8)CgoT8g2z;{f5Q~d@Gd)fkEJo^t$fcH&+ zUjp2~VNcZB@$9dg0Kawu{0iU(4tu%-fWAb=76lvEHwPopbhIs=v7+hq&1(|L&glA3 zYa%Lw!M2WMXE0-h(pE4ir0BSUv2-*lVzERv(=Iwfi9|9CUG#QX>vn|Voj}4}pY#F0){$hlq{!QSm(wY zMAtMmu3i&d>{%i>=UrOLE*Z@(_AH|yRA98yaIU}=O>`R0il=et5c`4Abz$H*Hy+1O zoGM;Ho{|H)KLX|JX4eDX#ju?wex-!?#kM_(~c%Kd5 zsPUaP{0YtPw&8m<&*g0Y0Ylm6a>h@er*PA!DC5lUzgppRO;T()70PhF=TO;T8P2u{ zKURj%BtWMVWjF#nntUot_glv(Dk{rx{iu+ot}8LsXts=z42`4L7X3(9bG zSSi((;b)gnA?nKTFO}hw*~th@Mqn}mlM$GVz+?pe_ag9n=a)YHc%;gsa(fsgG3FBAFieYY`p z{5CW9mf7?6>6Y~YZ^8SL*!!i;1S}3n8$hPb}0}1W*9L0r>!Hh;$v$rn*KC| zsOe`FPdYE{Cc#6RLjE!-zl`cNFFRxAPMZhbyV^YP@f6c>#5{h+ngav9+CXLTWXySK z!>B&_*IoY(ibeMF4Q5Y$D`m~xJJw7yUw;J5&}C@WPz0wV)1C+KxD|Dm>L>l-yHKe8 z*#K4Z-c!e5WFRm|VIK$_RfI5buM;u*0|RC*aKao2oRs`PV2J2I;4}#z34DO;$Fih< z+D)j{>~9`4bCm;uK4t8FMP!@@0=txC;4ffvH*D^a{6OFdq62}w%H}>~Y0gj4ovCd0 z!RCH?;m-xSy@ly3O1{GcPf- zf7abnqQ3fD4e^6R50iNQK`Px(je@O>l&3i~^6gUVO*1!W_7qM7 zpdo4I4x9OvX70CU?%&LwgMeQ*bK5{}!11Qpa{%P%0KE{wgh|bP2s3yn-$cs&^=B;v z*S~BBz@ZLV|KG_Hc;Bg5a-k`&Yac*cs9OIs(!rinuBP0}O}UrMfvIlga)Y~aXdh|z z9CpEe4)3(6^jozc`C6S*&3zTuC4e?rR5$d`ij9Dky@8_joF9Hjb?h1f@|Cfk0}d1~ zBQ94tydE6}cD^pm^{<78v7S$a^Wj45-cO3c`N#od@CiruPevMIdD)Zse(J>7iotD1 zhF3#fcKoc*(D6_FJMg?T;eD;=U#OlJFNdSf7hcNE>62H36g~;Lp}TQ{z*D$7)hPKI ztm;wXLspHF?Cm345n~TtF%S_S$X+qjqEzoBRT_N5CsY$RN?Y7Sx=)~qBi}BaTHh+hU#`(foD+aT! zk;|mY=a3%SjT2;FK`Aw{SirTg68g@ag*#`AR1Y^&J^6luAoduE!QUZi;5jnz4P~H@ zN_y&MY+H#KG~g`^F=$dncd8;`-81}s6@*)4Z5)JVZS)yb;MXO&{yM67oKCJ7SqOiq z6yj%d6PKci^@DgReAYk}sQcI_#p1{=>E>K=b1!&w(o+yYFn%IQ{d1&1*A5gOg@CtU z=1$#u2y#zKfoV_E#rpyV`m8+wn)+uwq7)wo@yiPrno+9iFz%}@X6|D%kH_5#Gw-X+ z*ACH3=S_!j)838@D(*Q|ZRW0W=PY++#LQ2vkzjxK5E_#gxdykZ9Q}aNndRs)LZP2; za?ixQCl96kInE!y>==apoE4UP&YzzH({7Z{$**_MF~c;j(XgJw)rUqORq{Dv_7E2Uu7cV7k#pC9$^&NaYtBasm4nXh`yfKkjfu*4>GI)K$izk~K{yV@ z%J>cbZ}`9N-{4<=W6M}m+2+RFnWlW3tGV~>ZT`mGADZ(X?;GBCN9IDk%wF-u#ucZm zd5!t4?~*gtdri4BL(@@dZz20uV}30{fAP>rh4&3JcY>xFxA?#5ztw-6KS+-~pBSwl zJyF+`YjPVVP7BQ39Jh&1agkTs@4rZX(b|BsaC@buf|K%$gG# zbP@P#$xL-q{-iEgK+N03^cI}oSqP(M?(?SnR(F>Ku%Sf ztKgY)`}=a-T;&$q-Xy(hsds^z2at=^TE5y`HPdZ9Z}x0;yTrD4yoC|hPI9`|EEddF z7$w$m6CN6sJPUt{rb)2fI&Wj#hRhS>IA z0DW|BX?!;pUu;^Ha2wVyn|nXn_P+NG*}T#AMRFI{&{#)Quc$etGVYqw$Ta1O#b21k zLyar`Df^EjKSkfwmGxa|9$A_>QWPfBWCSK7Fd2c#2>gGIfMd*83Cs`Ui9|5o8HtCZ z8S=TDUuneR>5OI68TE#!$%vXrO<0^QuD^Drfp1oZY@(rcjc_uZj)pDBK`e8&SnF?W zT4@L)8N(@>h;~Fft&DM3I@#H#Wt3cbg|O(jXv!-3HcQ_&y8pFUq&H3%9!JMO4bYv$ zzmLfyXv;^%;!HfmQ=kh##UF}AAE*m-J!mZ`tyHvtZUx;4x)U@7`a{supr=47(YIJi zjxFniqs!&EXy%Mck|mselh6lW=956f3>OhMj-@z`y{CKsRBWxPBZ)3QRh_@e*E8e(>HX8b zGj+$5`zk781l7u72tm~4uc|bm+66nG!H#_U{-T|Rs#Lk1DUUkFGekbhN_u=Z^rl!wL{GR$j0y2JkS042Y2*7MajHS(F+tQ{jTmJ&GS2o z78CsYI(i49#N{8JDJF>@q{%Z9zc2W7hJQPDT9@PNgG`@jO3Uz+$^I@^p*(R3!B>@*g8Jzh_pv7jW) zw#TE?ZBxbj^fh=)qJDIUa|Qn{-O!~##+Cmv@CvLhoVK-x^awX*BU6@^AWt}7uUyoK zj6+-^T6_xO>yh3l9T<&#-KbI_=slV2@b%AlpOinZbR9DOI5Ok4_s77c|N3{RJnmnf zK>qjy_(bvf4dfkSwm2bIFTj+g@%*80Ar5hYu#dM-rJhUdG0F@eI}3qwh>J?sHJ5%B z_&F3G-5|byd4bQy?voa6hr8!bHJ;Kq-+$_*9kx13uhaH->I+`3H=y`dZO71d_`VpH z_$6YaPubx6M_k%bSo!(B3Gm&(jdDSqe-t+o11n2D-`K)^ke`il)vj@__h-OekmvEn zS_grjQ&CmAuQNUf{5-5au|BPPQ2GVQtL&Kbkd|kVZGcoSzqr-!APqh09R-{CrFUA1^+6X=kon3nUG;^N_T2iP&QE=V9Q~Up}9*&*vwB zk7xgRX{TE7Jb?S{=Mr~{kfAIveq7^MX`ILZuYhYvDd_WWNxtIZ(tVo?e-C^-f2QF9 zOxGP7bC<0jZ`bv<__#(zH|cu!>Kl?=!$r~E z(mt*WZHMjO1DwVu$C2s%+Kx*PaJileJ5NaYxnhB?mgS$8_MPGqLlwYnnFh3djxWpq zlH}3vcK_c3ZfM4k&XuW`Ognutkc=dQZHZ)SC=rZU$#fNO|bhwk~vHM?BLJvcm0h zV30&7+f@ph-6HPjq~7+zPWfi7l9Nkx1=NDgqII*SMjBeE_JrvjUnO}8(jD-EhEhD5 zf#Fz(6+j|V($*c*2jp0Yv#X9H^Zu-9;U%pi!RlN}{Q zJDEs^A~NRG!1DciV!&VY0bv5`7<67 z9&G)Krc*+q?IG+M^h7py!h)h!T9xQGt1(OvW$9=lL<%~WN?2qO9(j=Uv?YR68~mmN?-l`Iky3Q_-}wSwsHSjf?08>amo*g$`F0se75S&v}v| z&rqFgMW`blhUZC(oB~4EwntjPt(CqWs~h+K@J#C*G&kYr!bJO~xf(n+#w7d}QqVJ( z^`FuYDyFehl>FV7!X--isXov=QyuP|-V71BJ2vaHYZ zXA>B?%=)}u!<5%&AR?39{}!OM7Q*_x{=t;n&;GL<(_5iWYb(t2`U=yHvR1X8L^ABZ zhD8Dyvd#LuzQfee4s-pi&+Xr&^;c_yyk5k#pmEZtv0~Tn21ZrrWuMXchn|PGOIb zQ=b2^yuJN@rS-Z0grOxheMGZNxc$uk02zuO>+dmC2~+!B!e0O1*z~KvqV$;_CgWv^ zkSV`^*#AetU_njo`mfsbdEJVsUEl8iuc1#aPSFMJ^ekYUH zr0wZBd-jj!o-^-zzrVltd%y3!`|ixVyKB5F@@zK2Enj>_5SDT(ibpd#ZqywBkEj$5 z9A}A7hzY<8Bu()>3V^GcnLiTRHJ!+$w@>q}aE6v+$Pp4Gy|LO}?SP@s;~>34Qqs5; z7GA0>GGu<%Uxnfe3*4aNXPB!;xoKLe^qN+UgadgBKb};O^@vP*y;`qV>oF8sk0Ixi zazdY-Ivy)Dz@x^HWT_FPk83_*qd-L#E)A{zJX+5RS)L*Ln+QA9A1f~<;8Ei=v)(Q5 z!~%awoON378Ew!Cw<&jmA^W=tdNdDzZgXnEo3+2O#7c z8?A^|dTW<0_suJxFSzGkknPSN?#?S;NJGyEfoK*Y0pSJU%aL9`1Ixr$kV&PsdcVWkgrFNOi}nw z3w?v8@3zpN(D+^peZR(;&-NeH*AM11eZv*XuUR)T9>#n1eUr{f$_=+v4xO(##Ou$Y zvk#&l&7l_&LAT>M^vO&lsNbjyBt{gh~;8vDH z$Ao2DMGk#ZmMTPL4*kL$`gnId1LGMO&%k&F#xpRUf&aSnnT4-_oGE=3N5_>O^ayVv-r$K0TIoT;X^Am-RN~JQPD`D^l*FGV zoR%7c`z8L5gwv8|@DYiBhj3bQ4DOcrw+N@D#$cz!zeYGMF$SX&p9MU$8@Ox~zx&e$^PZC~B)qStpVHn}^QriXC`Wnn|(DWB;HpuK(Mc7h=h;SoI&@j zN)d!mt06Wtz|>ZKY+-=sbb5d9|f0qTh4RHO}x)F8pH%|uUy?Dd;nsT4%1X3fN5 zGkKkv_-`|D%IrEgWF}rW6Wal9fO<%;fha2UDj+O!iQUy~%D$t2fz0N%kfm zRO#Ih5i<3s1K=37t6A_?|)D>vv7d!Ys5>UV7jkH5UTdN=b-CO?YHenIl8|Cy}_qx z9(opDg5i04CSOiN?vr>_>%)D401e! zYfzA|v|0vX4|N@|0rL=+un(?B6hPgYndm74A=LGeaQs~=bk9fk*&PobaP>c7i+?tK zcT&cj=}&YU5Tet>nJ$%s{TBQVEW{y{ltWDRkVI&4|Mo`*>Cz;H;w0phLedM|&btnw zIbJ-LEZ7b3vO2e(JSam0`%&~9FN&_S9d|j?*GMB3WaK9w$)LB9Akw-ym5*@UcA~>fUQ>fVKGx;XTT1%|;7S?CL3OULTIPOc} z3?BR@4xwZZc^gjCH!tHfIn&E!oI4TXz$bA)Mo1RJ>xXgw7jUgB*oir$Df!`n+yS`s z0G#RB(nJlJIB`~*AYLew!Z5GMO3jf{SCZ88QmTh|e=Mle5F5B(StMB<)T^J*4Qlp? zpk`#BmC8d@)3ahKssxD{Qmlt~qenGYn!KH4&wdD@q?C2194{V4 zb`~Nms*1$aor71O*F{u8!asp9X07m1Y`8;kUrJ3Ty3Obgmqdw*D(Xtplo2+193dIU z&tY{SD&y!TSu6l#Ad~m4m2PUJn>iJU1z*Lr>l%qUaqV(W{v_f!>*1yQFM4$!5pc-yp~6m%;EDKPNFY#+8>0k8!!|k3GgDd6|s^k8#rQ z7&AX6JL;_a%`}V`$+_cVn$4dBLq6V=0#m5_ZFGn8agiE#^m3UG!gTRD6(gIFB69!u z?_@%VH+n+m%jq})v4JlQ%fjpJGJrX_3lu(m8$!}0_&BcRQsqffF>!!E<^>gDqI(h? z_2Mcm65U1Usw;ckO_u{Raq{*<5dN7I<}2`F!V>oMbKv}d`ts&63sX&|l09UFuAnE< zFG#nG$nCK+((S!jw|V52_8Sx31;cJ@(_g?ntPlysP2oxCq&wlIZiN43+W&rVD$&P< zc5`ZVqWhdY)#$zyH_}xIhg>9k$U22{Ed5;>&f~B;undQ!)W8|>{*TDW1-AY~_gUCb z_lH_3yp@EfNnv8C>->w5M*`sf`_xO?Ly4Xbh&#&N$C<{89QS!Q8G7|JgpyJg%J1RZ zBh1mrdq{KEwogW-?;+PpBPXFha32oP?g0CU@SxX<-d9j{OMh}P#_@w8WUd`DgNG0j z*q4$;d9#^_>)jbOz~~HFi-he>bPoaa((SnGtcW|)mD1VWZ zf5<7=Pra^6vCWyDF7;ND-dm@n@2?Y+$^&uSrvxW#F9NbvF(weBmHG%U(fuCc2qA{p z#PpOwdGqNh+zsv#;Y$A&6RP>PIGMOpGRo7x7-r;2#(K%PF8#tVBVRIVBx7#+v0+Am zWXzF_rRn4_qcHA(kBcSa(}NMJp{)N2lK0{B6xh{+x5(uxxd^ZeFz zvcpVlGfEOMS;D`@05f( zcWN(^+LWG%>ulHIjzfss1x2q>4-+q=S8ddx2k8cwb{5`-M=)6q33M4G`=in zscCR6Cej4Cgr>#ul-G#ndyN>1FbqN%)7xY=%2f|7o$(xxXLWi==8~;L$<{;Kz!Ia{ zD8mss$Fbh1#BrVB$t{{}$$E3C*^(VmcDiIuRkB_#*&W7X@Sdxlt=e_D)NIu@ zfoiENk1#~a^0*G6H0&@s;YNnCfo$CcThcoDQ$7%_-5JLN*P_VWDQshTu&b7F>`9s3A2{|L zupcnHf97~NW$$N|w9daKD?L2K5|!nU`u8pN!3_O>TRaf{DH7O`IRN6;8WT z@waM|D{$+YF_6x4zhNeh)3)drs@7HAUbVi;N85BBG3+0^sFhy28<{winQ~(yW;iLT z9e1OES0#>^D2HgEI3ruOw1nQ2&Hxlo9eaA>Ap* z{a3@NnfSdoxy|S>ac#9@azuJI4KHVWaA`L)2Pi1Jxdf3pcDxTcv+H_8Z2vXsl}NqI z)V?K!Sfb@i%q2xe>;KGV~}bik&Np zOFrwk@J@TDxMBK|tLGcDh)nwH@Nq$3F64=b=E5h2j~(NA91nM*s@QpN-tx&4@}n2P zbyd*>r=SHqJ$UZH`ga8&$>dcROEVPHGx+p__ax#M)y18*{NuJ_dlfn==GQ@P_$@Y? ziNCUVzb*em$)~UNX%J`dDZ|blxu-@nJ9q|YGbKgQW`I^`D=u4BTv1hASyk-mnz(C1PvQLq59HsQmlq;Zg=9kr zqRy&fy9w1tU?&Va@*Ddl7)X!HsOpC!F}y z^6{Zmj<@j)jAvjx1LGMO&%k&F#xpRUfxq4i(E4z+5C3nC|EI?PJLCVE@j8?Lf5!he zWBOw{rHs>m>Zmc^t5Csfce$7ajsGilRz9o~x-&tM#gOLnf51M{bpG!bJc#2d7JM&ld@9xEmPI?cx~`y|eO zbvXZTAp8GuO9YGP;RbDmhPP@M((n!q@73@d8h%ehrH${C>Qry#^5xgL%GPX*x5nZw z_rmgdq(^j@{{JXY+34@?MCbEWlSM5$HZfDw={)fH zM^A({R3o1^EKkq5WQWgxrtg&Umt_Cn%Jj$48Ew2zfiB~rfuOBWK)eb#n*3{H(8tQp zJCL`DDWZf1ZfT?0KMffht|r!cJ12FWg8zTdrW{C=OE1X`vT~$)pnY+9k%b2^qFFVo=3h;1jn%d7JR*D`mNedrM3f9wpTRWWwC!0^h>Z?#r|mjgW7+Bev#^qI1XufCbInR zC4HLE(~DYF?Ly>-#{2Pch4b~*fzBnkE6mYyEI$);s*?@c5bsCLm3B%5|L>cx>(!Dz zQ=GHdUn%WR&E6;3{;itM_ZOxIr5%T8(DmccakYRxS|09|cBaWaLK0;=Uz2ubiaecX zpt8~XA!r`0b@E-%N3&1=R>4^+c-_GBb41b|;$a;p-$!2s-KBlyf;Xi6Oi^Z;_Y*|V zn~}ZFv(CGqj}}ix+Cf&dUmkBU0-*CZrs;g%O$B{4JD(bZejVse)QLyO$?@DE<#Am# z6ouCz56d_wXWwfxTPt`WPkG~Y7w3)EH&llemORj3u84ltQa=sS&P-vq$gkINTF=M7 z(elwc6^Lt=f>F?^p7-nfj@TA9Ov{nxeoq&@alvs+bLH?w6$Z(<>%R$CYyRXL7T(kOS;almwaaonucjJl`x75`8>Z_J{YkY!y z!lF;TQ3bC^@y$o@!JvmH$4%A_UP|Mu zy%9bF&8@z8doUm;ZJvH}PF9TEsdws(Afbo@*$l}S?@&xMQi`ZUO^|P6d%GS4eNWjA z+n6`h(ui*(7cC(9k+X-i>g=~f0`YKAzJ@1+vX9{zM<-0Bn@%1rOEn*jbqKu3XT%f` ziZ=@J2|%wJ8uk4_N{Qxyr&JXZj|>CjsANrq+9EAkijFfJ@dxCDxJBEhjc}a%wxN+3 zq|;oA;-R3teM?KszY#FjreG8IS|hPw`KH!*`NnuNURu>05aoDfQLrs4B(=$pw;h!S zwzR^6g0VKm(NBaU5FdDL!LXkcbT1l?kwwH(j&Av;2s*J~2aa^ff>1=BX63;qeUWOy z%aF2|D9^?hZHq*MZLuwyRJE*jE?rQ|V@f}UVo;9MwN2UQKFLsIa3@FMZ)t8soVpqLl-qdp7Iu)A|f){gd1H z{?4>T=-{tf5!Ub5`V84W`{nft>wD0lJs;NReIbT*+99pqsLT3%-#0-cm#okGObkB) zFSl9auLG0za9E%BZ5Z{5k`k>vp(-jF?8vG z*+1*^`0vpA%d|n>=VF-Bbke6Jv z1vS*{?DFPbi~fEsz>wo*GmJBQz@neh`V1@aH^&G9w|B)3n zwCexVqR;zn3|$s|Yy7_gpF+&njHXpef!~|RbF$+xWeqd|~0|U-rv(z5?A`|NNf1Uj0ihtAEyKc&R0R-j^TH`VAbohOEcXWzpw- z4v$AEaGY$0^%$QGeWLOB`T4QVqa=oQvZR7md!N?+D@KY5*4O2j pas_dev_free(dev) */ +#include #include #include +/* Thread-local index registers. The compiler emits THREADIDX_X / BLOCKIDX_X + * etc. as loads from these symbols (declared 'external thread_local global i32' + * in the device LLVM IR). pas_dev_launch sets them before each thunk call so + * the kernel body sees the correct indices -- the same values a GPU provides + * via hardware special registers. _Thread_local storage makes the design + * naturally OpenMP-parallelisable: each OS thread gets its own set. */ +/* Thread indices and block indices start at 0 (first and only thread/block). */ +_Thread_local int32_t __pas_tid_x = 0, __pas_tid_y = 0, __pas_tid_z = 0; +_Thread_local int32_t __pas_ctaid_x = 0, __pas_ctaid_y = 0, __pas_ctaid_z = 0; +/* Dimension counts default to 1: a unit grid so stride = BLOCKDIM*GRIDDIM = 1. + * pas_dev_launch overrides these before the first thunk call. */ +_Thread_local int32_t __pas_ntid_x = 1, __pas_ntid_y = 1, __pas_ntid_z = 1; +_Thread_local int32_t __pas_nctaid_x= 1, __pas_nctaid_y= 1, __pas_nctaid_z= 1; + /* Allocate n bytes of "device" memory; returns an opaque handle the host must * not dereference (the dereferenceability invariant). On the CPU device the * handle happens to be a real heap pointer, but Pascal code only ever hands it @@ -94,17 +109,37 @@ void *pas_dev_module_get_function(void *module, const char *name) { return 0; } -/* Launch a resolved entry. CPU device: the entry is the dispatch thunk; call - * it once with the marshalled argument array. Geometry is unused on the CPU - * device (BLOCKDIM_X/GRIDDIM_X lower to 1, so a single-thread grid is correct); - * it carries the same six values cuLaunchKernel consumes. */ +/* Launch a resolved entry. CPU device: the entry is the dispatch thunk. + * We emulate the GPU by iterating over every block (gx*gy*gz) and every thread + * within each block (bx*by*bz), setting the thread-local index registers before + * each call so the kernel body sees the correct THREADIDX_x/BLOCKIDX_x values. + * BLOCKDIM_x/GRIDDIM_x are constant for the whole launch and are set once. + * + * Loop order matches CUDA's row-major convention: x is the fastest-varying + * thread index, z the slowest, mirroring the hardware warp layout. */ typedef void (*pas_klaunch_fn)(void **); void pas_dev_launch(void *entry, long long gx, long long gy, long long gz, long long bx, long long by, long long bz, void **argv) { - (void)gx; (void)gy; (void)gz; - (void)bx; (void)by; (void)bz; - if (entry) - ((pas_klaunch_fn)entry)(argv); + if (!entry) return; + pas_klaunch_fn fn = (pas_klaunch_fn)entry; + /* Block and grid dimensions are constant across the launch. */ + __pas_ntid_x = (int32_t)bx; __pas_ntid_y = (int32_t)by; __pas_ntid_z = (int32_t)bz; + __pas_nctaid_x= (int32_t)gx; __pas_nctaid_y= (int32_t)gy; __pas_nctaid_z= (int32_t)gz; + for (long long gz_i = 0; gz_i < gz; gz_i++) + for (long long gy_i = 0; gy_i < gy; gy_i++) + for (long long gx_i = 0; gx_i < gx; gx_i++) { + __pas_ctaid_x = (int32_t)gx_i; + __pas_ctaid_y = (int32_t)gy_i; + __pas_ctaid_z = (int32_t)gz_i; + for (long long bz_i = 0; bz_i < bz; bz_i++) + for (long long by_i = 0; by_i < by; by_i++) + for (long long bx_i = 0; bx_i < bx; bx_i++) { + __pas_tid_x = (int32_t)bx_i; + __pas_tid_y = (int32_t)by_i; + __pas_tid_z = (int32_t)bz_i; + fn(argv); + } + } } diff --git a/src/pascal1981/codegen/exprs.py b/src/pascal1981/codegen/exprs.py index fed5144..f93dec7 100644 --- a/src/pascal1981/codegen/exprs.py +++ b/src/pascal1981/codegen/exprs.py @@ -727,18 +727,48 @@ def _to_i16(v: ir.Value) -> ir.Value: # Built-in Functions # ======================================================================== + # Mapping from Pascal builtin name to the thread-local global the CPU shim + # defines and pas_dev_launch sets before each kernel invocation. + _CPU_TLS_GLOBALS = { + 'THREADIDX_X': '__pas_tid_x', + 'THREADIDX_Y': '__pas_tid_y', + 'THREADIDX_Z': '__pas_tid_z', + 'BLOCKIDX_X': '__pas_ctaid_x', + 'BLOCKIDX_Y': '__pas_ctaid_y', + 'BLOCKIDX_Z': '__pas_ctaid_z', + 'BLOCKDIM_X': '__pas_ntid_x', + 'BLOCKDIM_Y': '__pas_ntid_y', + 'BLOCKDIM_Z': '__pas_ntid_z', + 'GRIDDIM_X': '__pas_nctaid_x', + 'GRIDDIM_Y': '__pas_nctaid_y', + 'GRIDDIM_Z': '__pas_nctaid_z', + } + def codegen_device_index_builtin(self, name: str) -> ir.Value: """Lower DEVICE thread/block index reads. - On the CPU-device stand-in, DEVICE code executes as a one-thread, - one-block grid. On NVPTX, lower to the corresponding special-register - read intrinsic. AMDGPU dimension plumbing is deferred; keep it - deterministic rather than inventing a half-wrong dispatch-ptr decode. + On the CPU-device stand-in, lower each builtin to a load from a + thread-local global variable defined and maintained by the CPU shim's + ``pas_dev_launch`` loop. This lets the shim drive every thread in the + launch geometry and have the kernel see the correct index on each + invocation -- the same semantic a GPU provides via hardware registers. + + On NVPTX, lower to the corresponding special-register read intrinsic. + AMDGPU dimension plumbing is deferred; keep it deterministic rather + than inventing a half-wrong dispatch-ptr decode. """ upper = name.upper() if not _is_gpu_triple(self.device_triple): - value = 1 if upper.startswith(('BLOCKDIM_', 'GRIDDIM_')) else 0 - return ir.Constant(ir.IntType(32), value) + # Emit a load from the thread-local global the CPU shim sets. + tls_name = self._CPU_TLS_GLOBALS[upper] + i32 = ir.IntType(32) + try: + gv = self.module.get_global(tls_name) + except KeyError: + gv = ir.GlobalVariable(self.module, i32, tls_name) + gv.storage_class = 'thread_local' + # linkage stays 'external' (default) — defined in cpu_device_shim.c + return self.builder.load(gv) if self.device_triple.startswith('nvptx'): nvptx_map = { 'THREADIDX_X': 'llvm.nvvm.read.ptx.sreg.tid.x', diff --git a/tests/integration/test_device_mandelbrot_x86.py b/tests/integration/test_device_mandelbrot_x86.py index 87880c9..81010c9 100644 --- a/tests/integration/test_device_mandelbrot_x86.py +++ b/tests/integration/test_device_mandelbrot_x86.py @@ -146,12 +146,14 @@ def test_mandelbrot_runs_on_x86_cpu_device(self): with open(harness_path, 'w') as f: f.write(_HARNESS_C) - # 3. Link and run. No Pascal runtime needed: the kernel is - # self-contained (no host I/O, no externs on the CPU-device - # path). + # 3. Link and run. The kernel IR now references the thread-local + # index globals (__pas_tid_x etc.) defined in cpu_device_shim.c, + # so link that in too (no other Pascal runtime needed). + shim_path = os.path.join( + os.path.dirname(__file__), '..', '..', 'runtime', 'cpu_device_shim.c') exe_path = os.path.join(tmpdir, 'mandelbrot_x86') link = subprocess.run( - ['clang', ir_path, harness_path, '-o', exe_path], + ['clang', ir_path, harness_path, shim_path, '-o', exe_path], capture_output=True, text=True) self.assertEqual(link.returncode, 0, msg=link.stderr) diff --git a/tests/test_device_index_intrinsics.py b/tests/test_device_index_intrinsics.py index 3a8fdb0..21270ba 100644 --- a/tests/test_device_index_intrinsics.py +++ b/tests/test_device_index_intrinsics.py @@ -53,11 +53,15 @@ def _compile(self, src, device_triple='x86_64-pc-linux-gnu'): ast = parse_source(src) return compile_to_llvm(ast, device_triple=device_triple) - def test_cpu_device_lowers_reads_to_one_thread_grid_constants(self): + def test_cpu_device_lowers_reads_to_tls_globals(self): ir = self._compile(DEVICE_SRC) self.assertNotIn('llvm.nvvm.read.ptx.sreg', ir) - self.assertIn('mul i32 0, 1', ir) - self.assertIn('add i32 %".3", 1', ir) + # Each builtin lowers to a load from a thread-local global so that + # pas_dev_launch can set the correct index before each thunk call. + self.assertIn('thread_local global i32', ir) + self.assertIn('@"__pas_tid_x"', ir) + self.assertIn('@"__pas_ntid_x"', ir) + self.assertIn('@"__pas_ctaid_x"', ir) def test_nvptx_lowers_all_reads_to_special_register_intrinsics(self): ir = self._compile(ALL_INDEX_READS_SRC, device_triple='nvptx64-nvidia-cuda') From ab275ee0b28e8a3f3d8be82130b7753367d1345e Mon Sep 17 00:00:00 2001 From: Dixie Flatline a/k/a McCoy Pauley Date: Sat, 27 Jun 2026 04:48:29 +0000 Subject: [PATCH 06/10] Move completed plan docs to docs/old/ - device-build-cleanup-plan.md: fully implemented (commit 47ba728) - CPU_DEVICE_TODO.md: CPU device now emulates full GPU launch geometry (commit 7713a86) --- {examples/device_ptx => docs/old}/CPU_DEVICE_TODO.md | 0 docs/{ => old}/device-build-cleanup-plan.md | 0 2 files changed, 0 insertions(+), 0 deletions(-) rename {examples/device_ptx => docs/old}/CPU_DEVICE_TODO.md (100%) rename docs/{ => old}/device-build-cleanup-plan.md (100%) diff --git a/examples/device_ptx/CPU_DEVICE_TODO.md b/docs/old/CPU_DEVICE_TODO.md similarity index 100% rename from examples/device_ptx/CPU_DEVICE_TODO.md rename to docs/old/CPU_DEVICE_TODO.md diff --git a/docs/device-build-cleanup-plan.md b/docs/old/device-build-cleanup-plan.md similarity index 100% rename from docs/device-build-cleanup-plan.md rename to docs/old/device-build-cleanup-plan.md From 84fa8c1f68aa90c449508207229b42b6a9b03345 Mon Sep 17 00:00:00 2001 From: Dixie Flatline a/k/a McCoy Pauley Date: Sat, 27 Jun 2026 04:52:33 +0000 Subject: [PATCH 07/10] Document CPU device emulation in example READMEs and code docs The two device_ptx example READMEs and the CUDA prescription doc still described DEVICE=cpu as 'not yet wired' / a single-thread grid. The CPU shim now emulates the full launch geometry, so update both READMEs with CPU + CUDA build/run instructions and prerequisites, fix the device-example.mk header comment, and update the now-stale 'single-thread grid' language in cuda-kernel-prescription.md and the codegen docstrings to match the TLS-index-register emulation. --- docs/cuda-kernel-prescription.md | 24 ++++++++++++++-------- examples/device_ptx/device-example.mk | 6 +++++- examples/device_ptx/fill_indices/README.md | 22 +++++++++++++------- examples/device_ptx/mandelbrot/README.md | 17 +++++++++++---- runtime/cpu_device_shim.c | 4 ++-- src/pascal1981/codegen/stmts.py | 14 ++++++++----- 6 files changed, 60 insertions(+), 27 deletions(-) diff --git a/docs/cuda-kernel-prescription.md b/docs/cuda-kernel-prescription.md index 7776b45..9a636e4 100644 --- a/docs/cuda-kernel-prescription.md +++ b/docs/cuda-kernel-prescription.md @@ -529,8 +529,13 @@ coerced to the kernel's parameter ABI — exactly what `cuLaunchKernel` consumes `pas_dev_launch(name, thunk, gx,gy,gz, bx,by,bz, argv)`. Geometry is supplied as 2 values (grid, block → a 1-D launch) or 6 (gx,gy,gz, bx,by,bz); the count is implied by the kernel's arity. On the CPU device `pas_dev_launch` invokes a compiler-emitted per-kernel dispatch -thunk `__pas_klaunch_(void** argv)` that unpacks `argv` and calls the kernel as a -single-thread grid, so its grid-stride loop covers the whole buffer. The kernel-name string +thunk `__pas_klaunch_(void** argv)` that unpacks `argv` and calls the kernel. +The CPU shim emulates the full launch: `pas_dev_launch` loops over the entire +`gx*gy*gz x bx*by*bz` grid, setting thread-local index registers (`__pas_tid_x` +etc., which `THREADIDX_*`/`BLOCKIDX_*` lower to on the CPU triple) before each +call, so the kernel sees the correct indices — the same semantic a GPU provides. +A grid-stride kernel covers the whole buffer on a single-thread grid, but the +emulation now covers one-thread-per-element kernels too. The kernel-name string and the geometry ride along unused on the CPU device — they are precisely what the CUDA shim will consume. So running the *same* Pascal program on a GPU is now a pure runtime-library swap: replace the four `cpu_device_shim.c` functions with CUDA Driver API wrappers and let @@ -589,13 +594,16 @@ currently gets away without it. (A3) for the no-host-symbols invariant. - §3 (entry points): on `device=x86` the kernel calling convention is inert/ignored - kernel *logic* still runs serially, so you can test kernel *correctness* on CPU before you have a GPU. -- §4 (intrinsics): provide CPU-device lowerings - `THREADIDX_X`→0, `BLOCKDIM_X`→1, - `SYNCTHREADS`→no-op - so a kernel run on the CPU executes as a single-thread grid and - produces the right scalar answer. This lets you validate kernel math with zero GPU. +- §4 (intrinsics): provide CPU-device lowerings - `THREADIDX_X`/`BLOCKIDX_X` lower to + loads from thread-local globals (`__pas_tid_x` etc.) that `pas_dev_launch` sets before each + kernel call; `BLOCKDIM_X`/`GRIDDIM_X` likewise. So a kernel run on the CPU executes across + the *full* launch geometry and produces the right answer. This lets you validate kernel + math with zero GPU. - §5 (orchestration): a CPU-device shim where `DEVALLOC`=`malloc`, copies=`memcpy`, and - `LAUNCH` marshals a `void**` and calls `pas_dev_launch`, which runs a per-kernel dispatch - thunk (single-thread grid). Same Pascal program, no GPU. Then swap the shim for the CUDA one - — the launch call site is already GPU-shaped, so only the runtime library changes. + `LAUNCH` marshals a `void**` and calls `pas_dev_launch`, which loops over the full grid, + setting thread-local index registers before each per-kernel dispatch-thunk call. Same + Pascal program, no GPU. Then swap the shim for the CUDA one — the launch call site is + already GPU-shaped, so only the runtime library changes. This is the CPU-device dividend the design designed for; lean on it. diff --git a/examples/device_ptx/device-example.mk b/examples/device_ptx/device-example.mk index 6d071b6..025224e 100644 --- a/examples/device_ptx/device-example.mk +++ b/examples/device_ptx/device-example.mk @@ -6,13 +6,17 @@ # DEVICE variable. The host Pascal is identical for both devices; only the build # differs -- which is the whole point of the shim design. # -# make # DEVICE=cpu (CPU stand-in -- see CPU_DEVICE_TODO.md) +# make # DEVICE=cpu (CPU stand-in -- emulates the full grid) # make DEVICE=cuda # real GPU via the CUDA Driver API shim + embedded PTX # make run [DEVICE=...] # build, then run # make clean # # The including Makefile sets: DEVICE_UNIT, HOST_SRC, EXE, FEATURES # (and may override SM or CUDA_HOME). +# +# On the CPU device the shim emulates a full GPU launch: pas_dev_launch loops +# over the whole gx*gy*gz x bx*by*bz geometry, setting thread-local index +# registers (__pas_tid_x etc.) before each kernel call. See runtime/cpu_device_shim.c. DEVICE ?= cpu diff --git a/examples/device_ptx/fill_indices/README.md b/examples/device_ptx/fill_indices/README.md index e4dca92..236d9d9 100644 --- a/examples/device_ptx/fill_indices/README.md +++ b/examples/device_ptx/fill_indices/README.md @@ -30,19 +30,27 @@ has an NVPTX backend. ```bash cd examples/device_ptx/fill_indices -make DEVICE=cuda run # build the host + device, run on the GPU +make DEVICE=cpu run # no GPU needed +make DEVICE=cuda run # real GPU ``` `DEVICE` selects the device-orchestration runtime shim at build time: - `DEVICE=cuda` — the real GPU path (CUDA Driver API shim + embedded PTX). Needs the CUDA toolkit headers, `-lcuda`, and an NVIDIA device. -- `DEVICE=cpu` (the default) — the CPU-device stand-in, **not yet wired for this - example**; see [`../CPU_DEVICE_TODO.md`](../CPU_DEVICE_TODO.md). The host - orchestration already works on the CPU shim; what it needs is a grid-stride - kernel, which is a deferred kernel change. - -A correct GPU run prints the first eight buffer elements (`0 1 2 3 4 5 6 7`) and +- `DEVICE=cpu` — the CPU-device stand-in. No GPU or CUDA toolkit required. The + CPU shim emulates a full GPU launch: `pas_dev_launch` loops over the complete + launch geometry (`gx×gy×gz` blocks × `bx×by×bz` threads), setting thread-local + index registers (`__pas_tid_x` etc.) before each kernel call so the kernel sees + the correct `THREADIDX_*`/`BLOCKIDX_*` values. Produces correct output + identical to the CUDA path. + +Prerequisites by device: +- **cpu**: Python + llvmlite, clang, `make -C runtime` (cpu archive, built by default). +- **cuda**: all of the above plus CUDA toolkit headers, `-lcuda`, and an NVIDIA device; + `make -C runtime cuda` for the cuda archive. + +A correct run prints the first eight buffer elements (`0 1 2 3 4 5 6 7`) and `OK: all 256 indices correct`. The build rules live in [`../device-example.mk`](../device-example.mk). diff --git a/examples/device_ptx/mandelbrot/README.md b/examples/device_ptx/mandelbrot/README.md index e7b3f54..ec15d88 100644 --- a/examples/device_ptx/mandelbrot/README.md +++ b/examples/device_ptx/mandelbrot/README.md @@ -32,7 +32,8 @@ the companion *mandelbrot-gpu* repository. ```bash cd examples/device_ptx/mandelbrot -make DEVICE=cuda run # build the host + device, run on the GPU +make DEVICE=cpu run # no GPU needed +make DEVICE=cuda run # real GPU ``` `DEVICE` selects the device-orchestration runtime shim at build time: @@ -40,9 +41,17 @@ make DEVICE=cuda run # build the host + device, run on the GPU - `DEVICE=cuda` — the real GPU path (CUDA Driver API shim + embedded PTX). Needs the CUDA toolkit headers, `-lcuda`, and an NVIDIA device. `SM` defaults to `sm_86` to mirror `mandelbrot.cu`. -- `DEVICE=cpu` (the default) — the CPU-device stand-in, **not yet wired for this - example**; see [`../CPU_DEVICE_TODO.md`](../CPU_DEVICE_TODO.md) (it needs a - grid-stride kernel, a deferred kernel change). +- `DEVICE=cpu` — the CPU-device stand-in. No GPU or CUDA toolkit required. The + CPU shim emulates a full GPU launch: `pas_dev_launch` loops over the complete + launch geometry (`gx×gy×gz` blocks × `bx×by×bz` threads), setting thread-local + index registers (`__pas_tid_x` etc.) before each kernel call so the kernel sees + the correct `THREADIDX_*`/`BLOCKIDX_*` values. Produces correct output + identical to the CUDA path. + +Prerequisites by device: +- **cpu**: Python + llvmlite, clang, `make -C runtime` (cpu archive, built by default). +- **cuda**: all of the above plus CUDA toolkit headers, `-lcuda`, and an NVIDIA device; + `make -C runtime cuda` for the cuda archive. The host orchestration is compiler-generated from the Pascal source; only the leaf runtime shim is C. The kernels are unchanged, so the emitted PTX remains the diff --git a/runtime/cpu_device_shim.c b/runtime/cpu_device_shim.c index 053bb6d..f2b888f 100644 --- a/runtime/cpu_device_shim.c +++ b/runtime/cpu_device_shim.c @@ -71,8 +71,8 @@ void pas_dev_free(void *dev_ptr) { * pas_dev_launch(entry, gx,gy,gz, bx,by,bz, argv); // cuLaunchKernel * * On the CPU device the "module" is the registry, get_function is a by-name - * lookup returning the thunk, and launch invokes the thunk as a single-thread - * grid (so a grid-stride kernel still covers the whole buffer). Swapping this + * lookup returning the thunk, and launch drives it across the full launch + * geometry (see pas_dev_launch). Swapping this * file for CUDA Driver API wrappers turns the *same* compiler output into a real * GPU launch with no Pascal-side change: load takes the embedded PTX blob, * get_function returns a CUfunction, launch is cuLaunchKernel. (A CUDA shim diff --git a/src/pascal1981/codegen/stmts.py b/src/pascal1981/codegen/stmts.py index 0cb3da3..fbb05f6 100644 --- a/src/pascal1981/codegen/stmts.py +++ b/src/pascal1981/codegen/stmts.py @@ -378,9 +378,11 @@ def _kernel_launch_thunk(self, fn: ir.Function) -> ir.Function: The thunk ``void __pas_klaunch_(i8** argv)`` unpacks ``argv`` into ``fn``'s parameter types and calls ``fn``. This is the CPU-device launch - dispatch: ``pas_dev_launch`` invokes it as a single-thread grid, so a - grid-stride kernel still covers the whole buffer. On a GPU the shim - dispatches the kernel by name out of the loaded module and the thunk is + dispatch: the CPU shim's ``pas_dev_launch`` loops over the full launch + geometry, setting thread-local index registers (``__pas_tid_x`` etc.) + before each thunk call, so the kernel sees the correct indices. On a + GPU the shim dispatches the kernel by name out of the loaded module and + the thunk is never called -- but it is harmless to emit, and LAUNCH only ever appears in host code (never a device compiland), so the thunk never collides with a ``ptx_kernel`` calling convention. @@ -420,8 +422,10 @@ def _codegen_device_orchestration(self, name: str, args: list) -> None: launch ABI: it marshals the kernel arguments into a ``void**`` array (the shape ``cuLaunchKernel`` consumes) and calls ``pas_dev_launch`` with the kernel-name string, a per-kernel dispatch thunk, the six geometry values, - and that array. On the CPU device ``pas_dev_launch`` runs the thunk - (single-thread grid); swapping the shim for the CUDA driver path reuses + and that array. On the CPU device ``pas_dev_launch`` loops over the + full launch geometry, setting thread-local index registers before each + thunk call (so the kernel sees the correct indices); swapping the shim + for the CUDA driver path reuses this exact call site -- it dispatches by name and ignores the thunk -- so no codegen change is needed to run the same program on a GPU (§5.2/§5.4). From 1d89492bdb63546eade14c7460b600dcd8f42e65 Mon Sep 17 00:00:00 2001 From: Dixie Flatline a/k/a McCoy Pauley Date: Sat, 27 Jun 2026 05:15:04 +0000 Subject: [PATCH 08/10] Stop GPU test from destroying the shared runtime archive; tighten @requires_gpu The GPU orchestration test's _build_cuda_runtime ran 'make -C runtime clean' against the shared source-tree runtime/build/ (which every other link test links as libpascalrt.a), then rebuilt only the cuda archive. If the cuda build failed -- e.g. a driver-only box (nvidia-smi + libcuda.so.1 but no CUDA toolkit headers) -- setUpClass raised, tearDownClass never ran, and runtime/build/ was left empty, cascading link failures into every other exe-requiring test (the trailing gcc-install-dir-libstdcxx warning in each truncated summary hid the real 'no such file: libpascalrt.a' cause). Two fixes: 1. Build the CUDA shim into an ISOLATED temp copy of the runtime sources so the shared runtime/build/ is never touched. A build failure raises unittest.SkipTest (clean skip) and leaks at most a /tmp dir, never a broken source tree. tearDownClass just removes the temp dir. 2. Tighten _probe_gpu to also require cuda.h (probed the way the Makefile looks for it, -I$CUDA_HOME/include), so @requires_gpu is False on a driver-only box and the test skips at collection rather than being selected and then failing the shim build. Verified: header probe returns False for an empty CUDA_HOME, True when cuda.h is planted; full suite 848 passed, 1 skipped. --- .../test_device_orchestration_gpu.py | 52 +++++++++++++++---- tests/support.py | 50 ++++++++++++++++-- 2 files changed, 86 insertions(+), 16 deletions(-) diff --git a/tests/integration/test_device_orchestration_gpu.py b/tests/integration/test_device_orchestration_gpu.py index ccbab7f..c775553 100644 --- a/tests/integration/test_device_orchestration_gpu.py +++ b/tests/integration/test_device_orchestration_gpu.py @@ -91,15 +91,32 @@ """ -def _build_cuda_runtime() -> str: - """Build (once) the runtime archive with the CUDA shim; return its path.""" - out = os.path.join(RUNTIME_DIR, "build", "libpascalrt.a") - subprocess.run(["make", "-C", RUNTIME_DIR, "clean"], - capture_output=True, check=True) - r = subprocess.run(["make", "-C", RUNTIME_DIR, "DEVICE_SHIM=cuda"], +def _build_cuda_runtime(tmpdir: str) -> str: + """Build the CUDA-shim runtime archive into an ISOLATED temp dir. + + Building in a *copy* of the runtime sources keeps the shared source-tree + ``runtime/build/`` (which every other link test links against as + ``libpascalrt.a``) completely untouched -- so this test can neither delete + nor repoint it, and a build failure leaves only a leaked /tmp dir, never a + broken source tree. Raises ``unittest.SkipTest`` (so a GPU box with a + broken/incomplete CUDA toolkit skips cleanly instead of erroring and + cascading failures into every other link test) if the build fails. + """ + # Copy the runtime sources (skip any pre-existing build/ dir) so the + # Makefile can build self-contained inside the temp dir. + for name in os.listdir(RUNTIME_DIR): + if name == 'build': + continue + src = os.path.join(RUNTIME_DIR, name) + if os.path.isfile(src): + shutil.copy(src, os.path.join(tmpdir, name)) + r = subprocess.run(["make", "-C", tmpdir, "DEVICE_SHIM=cuda"], capture_output=True, text=True) if r.returncode != 0: - raise RuntimeError(f"CUDA runtime build failed: {r.stderr}") + raise unittest.SkipTest(f"CUDA runtime build failed: {r.stderr}") + out = os.path.join(tmpdir, "build", "libpascalrt.a") + if not os.path.exists(out): + raise unittest.SkipTest("CUDA runtime build failed: no archive produced") return out @@ -108,13 +125,26 @@ class TestDeviceOrchestrationVectorAddGPU(unittest.TestCase): @classmethod def setUpClass(cls): - cls.runtime_lib = _build_cuda_runtime() + # Build into an isolated temp dir; never touch the shared + # runtime/build/ that every other link test depends on. + cls._runtime_tmp = tempfile.mkdtemp(prefix='pascalrt-cuda-') + try: + cls.runtime_lib = _build_cuda_runtime(cls._runtime_tmp) + except BaseException: + # setUpClass failure (incl. SkipTest) skips tearDownClass, so clean + # the temp dir here rather than leak it. + shutil.rmtree(cls._runtime_tmp, ignore_errors=True) + cls._runtime_tmp = None + raise @classmethod def tearDownClass(cls): - # Restore the default (CPU) shim so other suites/tools see the usual lib. - subprocess.run(["make", "-C", RUNTIME_DIR, "clean"], capture_output=True) - subprocess.run(["make", "-C", RUNTIME_DIR], capture_output=True) + # The only shared state we created is our private temp dir; the source + # tree's runtime/build/ was never touched, so there is nothing to + # restore. + tmp = getattr(cls, '_runtime_tmp', None) + if tmp: + shutil.rmtree(tmp, ignore_errors=True) def test_vector_add_runs_on_gpu(self): files = { diff --git a/tests/support.py b/tests/support.py index 46bf5d6..6296072 100644 --- a/tests/support.py +++ b/tests/support.py @@ -25,12 +25,41 @@ CAN_BUILD_EXE = HAS_LLVMLITE and HAS_CLANG +def _probe_cuda_headers() -> bool: + """True iff ```` is findable by clang the way the runtime build looks. + + ``runtime/cuda_launch.c`` does ``#include `` and the runtime + Makefile compiles it with ``-I$(CUDA_HOME)/include`` (plus clang's default + system search paths). Probe exactly that: a syntax-only compile of a + one-liner ``#include `` with ``-I$CUDA_HOME/include``. This returns + False on a box that has the NVIDIA driver (``nvidia-smi`` / ``libcuda.so.1``) + but not the CUDA toolkit headers, so the build+run GPU test is skipped at + collection rather than selected and then failing the shim compile. + """ + if not HAS_CLANG: + return False + cuda_home = os.environ.get("CUDA_HOME", "/usr/local/cuda") + try: + r = subprocess.run( + ["clang", "-x", "c", "-fsyntax-only", "-Wno-unknown-pragmas", + "-I", os.path.join(cuda_home, "include"), "-"], + input="#include \n", + capture_output=True, text=True, timeout=10) + return r.returncode == 0 + except Exception: + return False + + def _probe_gpu() -> bool: """True iff a real CUDA GPU run is possible here. Requires: an NVIDIA device visible to the driver, the NVPTX backend in this - llvmlite (to emit PTX), clang, and a linkable libcuda. Probed cheaply so the - @requires_gpu tests skip cleanly on CPU-only machines. + llvmlite (to emit PTX), clang, a linkable libcuda, AND the CUDA toolkit + headers (``cuda.h``) to build the CUDA shim. The last check is what skips + a driver-only box (``nvidia-smi`` + ``libcuda.so.1`` but no toolkit): the + test builds+runs the shim, so the headers are a hard prerequisite, not just + the driver. Probed cheaply so the @requires_gpu tests skip cleanly on + CPU-only and driver-only machines. """ if not CAN_BUILD_EXE: return False @@ -48,9 +77,20 @@ def _probe_gpu() -> bool: if any(Path(p).exists() for p in ( "/usr/lib/x86_64-linux-gnu/libcuda.so", "/usr/lib/x86_64-linux-gnu/libcuda.so.1")): - return True - cuda_home = os.environ.get("CUDA_HOME", "/usr/local/cuda") - return Path(cuda_home, "lib64", "stubs", "libcuda.so").exists() + has_libcuda = True + else: + cuda_home = os.environ.get("CUDA_HOME", "/usr/local/cuda") + has_libcuda = Path(cuda_home, "lib64", "stubs", "libcuda.so").exists() + if not has_libcuda: + return False + # The CUDA shim (runtime/cuda_launch.c) #includes , built with + # -I$(CUDA_HOME)/include. A box can have the *driver* (nvidia-smi + + # libcuda.so.1) but not the *toolkit* headers, which is enough to run an + # already-built shim but NOT to build it -- and this is a build+run test. + # Probe the header exactly the way the Makefile looks for it so @requires_gpu + # is false on a driver-only box (the test skips at collection) instead of + # being selected and then failing the shim build. + return _probe_cuda_headers() HAS_GPU = _probe_gpu() From 44f33cca5bc95e76c1d529dddd847ef1fb3ce512 Mon Sep 17 00:00:00 2001 From: Dixie Flatline a/k/a McCoy Pauley Date: Sat, 27 Jun 2026 05:25:40 +0000 Subject: [PATCH 09/10] Migrate GPU orchestration test to the --device-backend cuda path The CPU TLS work (commit 7713a86) regressed this test: it compiled the device unit to an x86 dev.ll (for the legacy launch-thunk host path), and that dev.ll now references __pas_tid_x etc. -- TLS globals defined only in cpu_device_shim.c, not the CUDA shim -- so the GPU link failed with 'undefined reference to __pas_tid_x'. Migrate the test off the legacy --embed-device-ptx path onto the decoupled cuda backend: compile the host with device_backend='cuda' (emits no launch thunk and no kernel-symbol reference, so no dev.ll is linked), objectify the PTX into a NUL-terminated __pas_device_ptx blob, and link host.ll + blob.o + cuda shim + -lcuda. This is the same 3-command flow the cleanup work established for the examples. Verified (no GPU needed): host .ll has external __pas_device_ptx, no __pas_klaunch, no kernel def; ld -r host.o + blob.o links clean -- no undefined TLS symbol. The actual CUDA shim build + run remain @requires_gpu. Full suite 848 passed, 1 skipped. --- .../test_device_orchestration_gpu.py | 51 ++++++++++++------- 1 file changed, 33 insertions(+), 18 deletions(-) diff --git a/tests/integration/test_device_orchestration_gpu.py b/tests/integration/test_device_orchestration_gpu.py index c775553..b974568 100644 --- a/tests/integration/test_device_orchestration_gpu.py +++ b/tests/integration/test_device_orchestration_gpu.py @@ -10,9 +10,12 @@ Gated by ``@requires_gpu`` so it skips cleanly on CPU-only machines. -The device kernel is compiled to PTX (NVPTX backend) and embedded into the host -compiland via ``--embed-device-ptx``; the host links the CUDA shim archive plus -``-lcuda``. Asserts the result is ``0 3 6 … 21``. +The device kernel is compiled to PTX (NVPTX backend) and packaged as a +NUL-terminated ``__pas_device_ptx`` data object that the host (compiled with +``--device-backend cuda``) references as an external symbol; the host links +that blob + the CUDA shim archive plus ``-lcuda``. The host emits no +in-process launch thunk and no kernel-symbol reference, so no device-unit +``.ll`` is linked. Asserts the result is ``0 3 6 … 21``. """ import os @@ -23,7 +26,7 @@ from pascal1981.compile_to_ptx import compile_file_to_ptx from pascal1981.features import resolve_features -from tests.support import (RUNTIME_DIR, compile_pascal_file, requires_gpu, +from tests.support import (RUNTIME_DIR, requires_gpu, temporary_pascal_project) _WIDE = resolve_features(overrides=['wide-integers']) @@ -154,7 +157,6 @@ def test_vector_add_runs_on_gpu(self): } cuda_home = os.environ.get("CUDA_HOME", "/usr/local/cuda") with temporary_pascal_project(files) as proj: - inc = os.path.join(proj, 'vadd.inc') dev = os.path.join(proj, 'vadd.pas') main = os.path.join(proj, 'main.pas') @@ -165,15 +167,13 @@ def test_vector_add_runs_on_gpu(self): with open(ptx_path, 'w') as f: f.write(ptx) - # 2/3. device unit + interface -> host x86 .ll (the device .ll - # defines the kernel symbol the host launch thunk references; the - # real kernel comes from the embedded PTX at run time). - dev_ll = compile_pascal_file(dev, os.path.join(proj, 'vadd.ll'), - features=_WIDE) - compile_pascal_file(inc, os.path.join(proj, 'vadd-iface.ll'), - features=_WIDE) - - # 4. host program -> .ll, embedding the PTX. + # 2. host program -> .ll with the cuda device backend. This emits + # no in-process launch thunk and no kernel-symbol reference, so the + # host .ll needs no device-unit .ll to link against -- the real + # kernel comes from the PTX loaded at run time. The PTX text is + # packaged as its own NUL-terminated __pas_device_ptx data object + # that the host references as an external symbol (the blob the CUDA + # shim reads as a C-string and cuModuleLoadData's). from pascal1981.codegen import compile_to_llvm from pascal1981.parser import parse_file from pascal1981.type_checker import PascalTypeChecker @@ -183,12 +183,27 @@ def test_vector_add_runs_on_gpu(self): main_ll = os.path.join(proj, 'main.ll') with open(main_ll, 'w') as f: f.write(compile_to_llvm(ast, source_file=main, features=_WIDE, - embed_device_ptx_text=ptx)) - - # 5. link host + device .ll + CUDA shim + -lcuda. + device_backend='cuda')) + + # 3. objectify the PTX text into a __pas_device_ptx data blob + # (PTX *text* + trailing NUL; NOT ptxas/cubin output). incbin uses + # the absolute ptx_path so it resolves regardless of assembler CWD. + blob_s = os.path.join(proj, 'dev_ptx_blob.s') + with open(blob_s, 'w') as f: + f.write('\t.section .rodata\n' + '\t.globl __pas_device_ptx\n' + '__pas_device_ptx:\n' + f'\t.incbin "{ptx_path}"\n' + '\t.byte 0\n') + blob_o = os.path.join(proj, 'dev_ptx_blob.o') + asm = subprocess.run(['clang', '-c', blob_s, '-o', blob_o], + capture_output=True, text=True) + self.assertEqual(asm.returncode, 0, msg=asm.stderr) + + # 4. link host .ll + PTX blob + CUDA shim + -lcuda. exe = os.path.join(proj, 'vadd-gpu') link = subprocess.run( - ['clang', main_ll, dev_ll, self.runtime_lib, + ['clang', main_ll, blob_o, self.runtime_lib, '-L' + os.path.join(cuda_home, 'lib64', 'stubs'), '-lcuda', '-o', exe], capture_output=True, text=True) From bddfc95670b16e5f4d39c445d306cbe16b014b5d Mon Sep 17 00:00:00 2001 From: Dixie Flatline a/k/a McCoy Pauley Date: Sat, 27 Jun 2026 05:29:36 +0000 Subject: [PATCH 10/10] README: state the 'make -C runtime' prerequisite for tests The testing section omitted that link-requiring tests link the hardcoded runtime/build/libpascalrt.a and FAIL (not skip) without it. State the prerequisite up front so a clean-tree run isn't a surprise. --- README.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/README.md b/README.md index c4bb80d..1f86a77 100644 --- a/README.md +++ b/README.md @@ -472,10 +472,17 @@ One unified test suite built on `pytest`, with automatic detection of optional d ### Run the entire test suite ```bash +# Build the C runtime archive once (link-requiring tests link against it). +make -C runtime # All tests from a source checkout; codegen tests auto-skip if llvmlite/clang are unavailable PYTHONPATH=src python3 -m pytest tests/ -q ``` +`make -C runtime` is **required** for the integration/link tests: they link +`runtime/build/libpascalrt.a` (hardcoded in `tests/support.py`), and without +that archive they *fail* (they do not skip). Parser/typecheck tests need +no dependencies at all; codegen IR-only tests need `llvmlite` but not the archive. + If you installed the package into the active environment, `PYTHONPATH=src` is not needed.