From 571c9bb965879c7b18bd2d04624ad3a26761c5d6 Mon Sep 17 00:00:00 2001
From: Dixie Flatline a/k/a McCoy Pauley <ubuntu@localhost>
Date: Sat, 27 Jun 2026 03:45:30 +0000
Subject: [PATCH 01/10] Add plan to collapse GPU device-build pipeline to three
 commands

---
 docs/device-build-cleanup-plan.md | 192 ++++++++++++++++++++++++++++++
 1 file changed, 192 insertions(+)
 create mode 100644 docs/device-build-cleanup-plan.md
diff --git a/docs/device-build-cleanup-plan.md b/docs/device-build-cleanup-plan.md
new file mode 100644
index 0000000..98dac92
--- /dev/null
+++ b/docs/device-build-cleanup-plan.md
@@ -0,0 +1,192 @@
+# Plan: collapse the GPU device-build pipeline to three commands
+
+Status: PROPOSED. No code changed yet — this is the design.
+
+## 1. Where the bodies are buried (current state)
+
+The end-to-end GPU path for an example (`examples/device_ptx/mandelbrot`,
+`fill_indices`) is driven by `examples/device_ptx/device-example.mk` and the
+hand-written `scripts/build-cuda-host.sh`. For `DEVICE=cuda` it does **five**
+build actions plus a full runtime rebuild, for two source files:
+
+```
+1. dev.ptx   = python -m pascal1981.compile_to_ptx  dev.pas  --cpu sm_86   # device -> PTX
+2. dev.ll    = python -m pascal1981                  dev.pas               # device -> host-x86 .ll
+3. host.ll   = python -m pascal1981 --embed-device-ptx dev.ptx host.pas    # host -> .ll, PTX baked in
+4. runtime   = make -C runtime clean && make -C runtime DEVICE_SHIM=cuda   # wholesale archive rebuild
+5. link      = clang host.ll dev.ll libpascalrt.a -L.../stubs -lcuda -o exe
+```
+
+(`build-cuda-host.sh` has an extra step 3 compiling the interface `.inc` too.)
+
+### The jank, itemized
+
+- **J1 — the device unit is compiled twice, for two unrelated reasons.**
+  Once to NVPTX PTX (the real kernel), once to a *host-x86* `.ll` whose only
+  job is to define the kernel symbol so the link resolves. The second compile
+  produces dead code: it never runs on the GPU.
+
+- **J2 — `dev.ll` exists solely to satisfy a link-time reference from dead
+  code.** Host codegen emits, for every `LAUNCH`, an internal dispatch thunk
+  `__pas_klaunch_<kernel>` that *calls the external kernel symbol*
+  (`codegen/stmts.py::_kernel_launch_thunk`). That thunk is the CPU-device
+  stand-in; on the GPU the CUDA shim dispatches the kernel by name out of the
+  loaded module (`runtime/cuda_launch.c`) and the thunk is never called. But
+  because the thunk *statically references* `@<kernel>`, the linker demands a
+  definition, so we drag in `dev.ll`. The reference is real; the call is dead.
+
+- **J3 — host `.ll` is coupled to the device artifact via `--embed-device-ptx`.**
+  The PTX text is baked into `host.ll` as the `__pas_device_ptx` blob at host
+  compile time (`codegen/stmts.py::_device_ptx_ptr`). So "compile the host"
+  cannot run before "compile the device," and any PTX change forces a host
+  recompile. The host source has nothing to do with the kernel text; this is a
+  packaging concern leaking into the compiler front end.
+
+- **J4 — two CLIs with divergent flags and defaults.** `pascal1981` and
+  `pascal1981.compile_to_ptx` duplicate `--device-triple`, `-f`, `--dialect`,
+  and disagree on defaults (`--cpu sm_70` vs none; device-triple host vs NVPTX).
+  The PTX driver re-implements parse/check/lower glue.
+
+- **J5 — the runtime archive is rebuilt from clean on every GPU build.** The cpu
+  and cuda shims define the same `pas_dev_*` symbols and cannot coexist in one
+  archive, so the Makefile's `runtime-cuda` target does `make clean && make
+  DEVICE_SHIM=cuda` every time. There is no prebuilt-runtime story.
+
+## 2. Target workflow (the goal)
+
+Runtime is prebuilt **once**. Then, per example, exactly three commands:
+
+```bash
+# 1. one command against the device file -> .ptx (+ optional .ll, + embeddable object)
+pascal1981 --target ptx  mandelbrot.pas  mandelbrot.ptx  --sm sm_86  -f wide-integers
+
+# 2. one command against the host file -> .ll  (no PTX coupling)
+pascal1981 --target host --device-backend cuda  mandelbrot_host.pas  mandelbrot_host.ll  -f wide-integers
+
+# 3. one clang command to link the host
+clang mandelbrot_host.ll mandelbrot.ptx.o  libpascalrt_cuda.a  -L$CUDA/lib64/stubs -lcuda -o mandelbrot_host
+```
+
+`ptxas`/`cubin` stays optional (a stronger check, or an `.o` route — see §3.3).
+No second device compile. No `dev.ll`. No runtime rebuild. The host `.ll` is
+independent of the kernel text.
+
+## 3. The changes
+
+### 3.1 Kill `dev.ll` by gating the CPU stand-in machinery (fixes J1, J2)
+
+Root cause is the thunk's static reference to the kernel symbol. Add a host
+compile knob `--device-backend {cpu,cuda}` (plumbed into the codegen
+constructor, `codegen/base.py`). Then in `_codegen_device_orchestration` /
+`_emit_launch_registry`:
+
+- **backend=cuda:** do **not** emit the `__pas_klaunch_<kernel>` thunk or the
+  `__pas_klaunch_registry` table. The GPU launch path only needs
+  `pas_dev_module_load(registry=NULL, ptx)` → `pas_dev_module_get_function(mod,
+  name)` → `pas_dev_launch(entry, geom, argv)`. Pass a null registry pointer;
+  the cuda shim already ignores it (`runtime/cuda_launch.c::pas_dev_module_load`
+  casts `registry` to `(void)`). With no thunk, there is **no reference to the
+  kernel symbol in host `.ll`**, so the link needs no `dev.ll`.
+
+- **backend=cpu:** unchanged — emit thunk + registry exactly as today. The CPU
+  device still resolves and calls the thunk.
+
+Fallback if we want to keep the thunk for symmetry: emit the kernel extern as
+`extern_weak` so an undefined symbol resolves to null instead of forcing a
+definition. Preferred is to drop it entirely on the GPU path — less dead IR.
+
+Net: the GPU build compiles the device unit **once** (to PTX) and never produces
+or links `dev.ll`.
+
+### 3.2 Decouple PTX embedding from host compile (fixes J3)
+
+Stop baking PTX into `host.ll`. Instead, the host references an *external*
+`__pas_device_ptx` symbol, and the PTX blob becomes its own object linked at
+step 3. Two ways to produce that object from `mandelbrot.ptx`; pick one:
+
+- **(a) emit it from the device command.** `--target ptx` also writes
+  `mandelbrot.ptx.o` (or `.s`) defining `const char __pas_device_ptx[]` via an
+  `.incbin`-style stub or `llvm-mc`. Keeps "one command against the device file"
+  literally true and the link a single `clang ... mandelbrot.ptx.o ...`.
+
+- **(b) objectify at link time** with a documented one-liner
+  (`ld -r -b binary`, or a 3-line `.s` using `.incbin "mandelbrot.ptx"`). The
+  clang link line gains one input; the host `.ll` stays pure.
+
+Either way `codegen/stmts.py::_device_ptx_ptr` changes from "embed the text" to
+"declare `external global` `__pas_device_ptx`," and `--embed-device-ptx`
+becomes optional/legacy. Host compile no longer depends on the device artifact.
+
+If we would rather not add a link input, the legacy `--embed-device-ptx` path
+can stay as an opt-in for a strictly two-input link — but the default clean path
+should decouple.
+
+### 3.3 Optional `ptxas` / cubin route
+
+For users who want the assembled artifact: `--target ptx` can additionally drive
+`ptxas -arch=$SM -o mandelbrot.cubin mandelbrot.ptx` when the toolkit is
+present, and §3.2's object can embed the cubin instead of PTX (the cuda shim
+then `cuModuleLoadData`s a cubin, which it already accepts). This is a strict
+add-on; the PTX-text path remains the no-GPU-needed default.
+
+### 3.4 Fold the two CLIs into one (fixes J4)
+
+Make `--target {host,ptx}` a flag on the single `pascal1981` driver
+(`compile_to_llvm.py::main`), sharing feature resolution, dialect, and check
+flags. `--target ptx` sets the device triple to `nvptx64-nvidia-cuda`, honors
+`--sm` (alias the old `--cpu`), and routes through the existing
+`compile_to_ptx.llvm_ir_to_ptx`. Keep `python -m pascal1981.compile_to_ptx` as a
+thin shim that forwards to `--target ptx` for back-compat and existing tests
+(`tests/integration/test_device_mandelbrot_ptx.py`,
+`fill_indices/RUNNING_PTX.md`).
+
+### 3.5 Prebuild both runtime archives once (fixes J5)
+
+Split the shim out of the single archive so neither dominates:
+
+- Build a **core** archive `libpascalrt.a` (everything except the two
+  `*_device_shim` / `cuda_launch` shims), plus two tiny shim archives
+  `libpascalrt_dev_cpu.a` and `libpascalrt_dev_cuda.a`. Consumers link core +
+  the chosen shim. No symbol clash, no rebuild.
+
+  Or, simpler for callers: produce two full archives `libpascalrt_cpu.a` and
+  `libpascalrt_cuda.a` in one `make` invocation (two `ar` outputs from one core
+  object set + one shim each). Either removes the `runtime-cuda` clean-rebuild.
+
+The example Makefile then just picks the archive; `runtime-cuda` (the phony that
+does `make clean && make DEVICE_SHIM=cuda`) is deleted.
+
+## 4. Resulting build files
+
+- `device-example.mk` drops the `dev.ll` rule, the `runtime-cuda` phony, and the
+  `--embed-device-ptx` on the host rule. The `cuda` branch becomes:
+  ```make
+  $(BUILD)/dev.ptx:  $(DEVICE_UNIT) ; $(PAS) --target ptx $< $@ --sm $(SM) $(FEATURES)
+  $(BUILD)/dev.o:    $(BUILD)/dev.ptx ; <objectify per 3.2>
+  $(BUILD)/host.ll:  $(HOST_SRC) ; $(PAS) --target host --device-backend cuda $(FEATURES) $< $@
+  $(EXE): $(BUILD)/host.ll $(BUILD)/dev.o ; clang $^ $(RUNTIME_CUDA) -L$(CUDA_HOME)/lib64/stubs -lcuda -o $@
+  ```
+- `scripts/build-cuda-host.sh` collapses from 6 steps to 3 (+ optional ptxas),
+  and stops rebuilding the runtime.
+
+## 5. Migration / compatibility
+
+- Keep `compile_to_ptx` and `--embed-device-ptx` working (deprecated aliases) so
+  existing tests and the `RUNNING_PTX.md` external-launcher recipe keep passing.
+- CPU-device path is untouched by design (backend=cpu keeps thunk+registry); the
+  deferred grid-stride work in `CPU_DEVICE_TODO.md` is orthogonal.
+- The PTX ABI is unchanged — same `.visible .entry`, same parameters — so the
+  drop-in property the mandelbrot README sells (matching `mandelbrot.cu`
+  symbol-for-symbol) is preserved. The validation ladder in
+  `RUNNING_PTX.md`/`cuda-kernel-prescription.md` still applies rung for rung.
+
+## 6. Validation
+
+- Existing PTX-text + `ptxas` checks (mandelbrot/fill READMEs) must still pass on
+  the new `--target ptx` output, byte-comparable to the old `compile_to_ptx`.
+- A new check: host `.ll` built with `--device-backend cuda` has **no undefined
+  kernel symbol** and **no `__pas_klaunch_` thunk** (`grep`-able), proving J1/J2
+  are gone.
+- Link the three-command path on a GPU box and run the existing host programs;
+  output (ASCII mandelbrot, `OK: all 256 indices correct`) must be unchanged.
+```

From 47ba728f614d64627ec13a46c2f2de287a3ea6da Mon Sep 17 00:00:00 2001
From: Dixie Flatline a/k/a McCoy Pauley <ubuntu@localhost>
Date: Sat, 27 Jun 2026 03:58:13 +0000
Subject: [PATCH 02/10] Collapse GPU device build to three commands

- Add --device-backend cuda: host emits no launch thunk/registry and no
  kernel-symbol reference, eliminating the dead second device compile (dev.ll).
- Reference the embedded PTX as an external __pas_device_ptx symbol on the cuda
  backend; package PTX text as its own NUL-terminated blob object at link time.
- Unify the PTX CLI into 'pascal1981 --target ptx' (--sm/--emit-llvm); keep
  compile_to_ptx as a deprecated alias.
- Prebuild both runtime archives (libpascalrt_cpu.a / _cuda.a) once; drop the
  clean-rebuild-on-switch dance.
- Update device-example.mk, build-cuda-host.sh, READMEs, and the plan doc;
  add cuda-backend decoupling regression tests.
---
 docs/device-build-cleanup-plan.md          | 51 +++++++++++-----
 examples/device_ptx/device-example.mk      | 40 +++++++------
 examples/device_ptx/fill_indices/README.md |  8 ++-
 examples/device_ptx/mandelbrot/README.md   | 20 +++++--
 runtime/Makefile                           | 69 ++++++++++++++--------
 scripts/build-cuda-host.sh                 | 66 ++++++++++-----------
 src/pascal1981/codegen/__init__.py         | 10 +++-
 src/pascal1981/codegen/base.py             | 10 +++-
 src/pascal1981/codegen/stmts.py            | 33 ++++++++++-
 src/pascal1981/compile_to_llvm.py          | 61 ++++++++++++++++++-
 tests/test_device_ptx_module.py            | 49 ++++++++++++++-
 11 files changed, 309 insertions(+), 108 deletions(-)
 mode change 100644 => 100755 scripts/build-cuda-host.sh

diff --git a/docs/device-build-cleanup-plan.md b/docs/device-build-cleanup-plan.md
index 98dac92..84e1c51 100644
--- a/docs/device-build-cleanup-plan.md
+++ b/docs/device-build-cleanup-plan.md
@@ -63,8 +63,8 @@ pascal1981 --target ptx  mandelbrot.pas  mandelbrot.ptx  --sm sm_86  -f wide-int
 # 2. one command against the host file -> .ll  (no PTX coupling)
 pascal1981 --target host --device-backend cuda  mandelbrot_host.pas  mandelbrot_host.ll  -f wide-integers
 
-# 3. one clang command to link the host
-clang mandelbrot_host.ll mandelbrot.ptx.o  libpascalrt_cuda.a  -L$CUDA/lib64/stubs -lcuda -o mandelbrot_host
+# 3. one clang command to link the host (after objectifying the PTX blob)
+clang mandelbrot_host.ll mandelbrot_ptx_blob.o  libpascalrt_cuda.a  -L$CUDA/lib64/stubs -lcuda -o mandelbrot_host
 ```
 
 `ptxas`/`cubin` stays optional (a stronger check, or an `.o` route — see §3.3).
@@ -101,21 +101,42 @@ or links `dev.ll`.
 ### 3.2 Decouple PTX embedding from host compile (fixes J3)
 
 Stop baking PTX into `host.ll`. Instead, the host references an *external*
-`__pas_device_ptx` symbol, and the PTX blob becomes its own object linked at
-step 3. Two ways to produce that object from `mandelbrot.ptx`; pick one:
-
-- **(a) emit it from the device command.** `--target ptx` also writes
-  `mandelbrot.ptx.o` (or `.s`) defining `const char __pas_device_ptx[]` via an
-  `.incbin`-style stub or `llvm-mc`. Keeps "one command against the device file"
-  literally true and the link a single `clang ... mandelbrot.ptx.o ...`.
+`__pas_device_ptx` symbol (`codegen/stmts.py::_device_ptx_ptr` now declares
+`@__pas_device_ptx = external constant [0 x i8]` on the cuda backend), and the
+PTX blob becomes its own object linked at step 3.
+
+**What that object is — and is NOT.** It is an object file defining ONE data
+symbol, `__pas_device_ptx`, holding the PTX **text bytes, NUL-terminated**,
+because the CUDA shim reads it as a `const char *` C-string
+(`runtime/cuda_launch.c` checks `ptx[0]=='\0'` then `cuModuleLoadData`s it). It
+is **not** `ptxas`/cubin output. Name it for what it is —
+`mandelbrot_ptx_blob.o` — **never `.ptx.o`**, which invites feeding it to the
+wrong tool. Two correctness traps the naming hid:
+
+1. **NUL termination.** A bare `.incbin "mandelbrot.ptx"` is *not*
+   NUL-terminated; the stub must append a `.byte 0` or the shim reads past the
+   blob.
+2. The object carries no code, just `.rodata`; it is produced by the assembler,
+   not a compiler pass.
+
+The objectifier is a 4-line assembly stub assembled with `clang -c`:
+
+```asm
+        .section .rodata
+        .globl  __pas_device_ptx
+__pas_device_ptx:
+        .incbin "mandelbrot.ptx"
+        .byte 0                  # the C-string NUL the shim requires
+```
 
-- **(b) objectify at link time** with a documented one-liner
-  (`ld -r -b binary`, or a 3-line `.s` using `.incbin "mandelbrot.ptx"`). The
-  clang link line gains one input; the host `.ll` stays pure.
+The example Makefile / `build-cuda-host.sh` generate this stub from `dev.ptx`.
+`--embed-device-ptx` stays as a legacy opt-in (host-embeds, two-input link).
+With the default decoupled path, host compile no longer depends on the device
+artifact.
 
-Either way `codegen/stmts.py::_device_ptx_ptr` changes from "embed the text" to
-"declare `external global` `__pas_device_ptx`," and `--embed-device-ptx`
-becomes optional/legacy. Host compile no longer depends on the device artifact.
+**Verified:** `host.o` built with `--device-backend cuda` shows `U
+__pas_device_ptx` and no `__pas_klaunch_*` / kernel symbol; `ld -r host.o
+mandelbrot_ptx_blob.o` resolves it to a defined `R __pas_device_ptx`.
 
 If we would rather not add a link input, the legacy `--embed-device-ptx` path
 can stay as an opt-in for a strictly two-input link — but the default clean path
diff --git a/examples/device_ptx/device-example.mk b/examples/device_ptx/device-example.mk
index 7d79589..2e9901d 100644
--- a/examples/device_ptx/device-example.mk
+++ b/examples/device_ptx/device-example.mk
@@ -21,6 +21,7 @@ THIS_MK     := $(lastword $(MAKEFILE_LIST))
 REPO        := $(abspath $(dir $(THIS_MK))/../..)
 RUNTIME     := $(REPO)/runtime
 RUNTIME_LIB := $(RUNTIME)/build/libpascalrt.a
+RUNTIME_CUDA:= $(RUNTIME)/build/libpascalrt_cuda.a
 
 PAS := PYTHONPATH=$(REPO)/src python3 -m pascal1981
 PTX := PYTHONPATH=$(REPO)/src python3 -m pascal1981.compile_to_ptx
@@ -40,29 +41,32 @@ $(BUILD):
 	mkdir -p $(BUILD)
 
 ifeq ($(DEVICE),cuda)
-# ---- real GPU: CUDA Driver API shim + embedded PTX (Strategy 1) -------------
-# The device kernel is compiled twice, on purpose:
-#   * to PTX, embedded into the host so the CUDA shim cuModuleLoadData's it;
-#   * to a host .ll, which defines the kernel symbol the host's launch thunk
-#     links against (dead at run time -- the real kernel is the loaded PTX).
+# ---- real GPU: CUDA Driver API shim, three commands ------------------------
+# The device kernel is compiled ONCE, to PTX (the real kernel).  The host is
+# compiled with --device-backend cuda, so it emits no in-process launch thunk
+# and no kernel-symbol reference -- there is no second 'dev.ll' device compile.
+# The PTX text is packaged as its own object (a NUL-terminated __pas_device_ptx
+# byte blob the host references as an external symbol); the CUDA shim
+# cuModuleLoadData's it at run time.  Build the cuda runtime archive once with
+#   make -C runtime cuda
+# (this Makefile does not rebuild it on every example build).
 $(BUILD)/dev.ptx: $(DEVICE_UNIT) | $(BUILD)
-	$(PTX) $< $@ --cpu $(SM) $(FEATURES)
+	$(PAS) --target ptx $< $@ --sm $(SM) $(FEATURES)
 
-$(BUILD)/dev.ll: $(DEVICE_UNIT) | $(BUILD)
-	$(PAS) $(FEATURES) $< $@
+# Objectify the PTX into a single data symbol the host links against.  This is a
+# data blob (PTX *text* + a trailing NUL for the shim's C-string read), NOT
+# ptxas/cubin output -- hence the _blob.o name, never .ptx.o.
+$(BUILD)/dev_ptx_blob.s: $(BUILD)/dev.ptx | $(BUILD)
+	printf '\t.section .rodata\n\t.globl __pas_device_ptx\n__pas_device_ptx:\n\t.incbin "$(BUILD)/dev.ptx"\n\t.byte 0\n' > $@
 
-$(BUILD)/host.ll: $(HOST_SRC) $(BUILD)/dev.ptx | $(BUILD)
-	$(PAS) $(FEATURES) --embed-device-ptx $(BUILD)/dev.ptx $< $@
+$(BUILD)/dev_ptx_blob.o: $(BUILD)/dev_ptx_blob.s
+	clang -c $< -o $@
 
-# The runtime archive must carry the CUDA shim (cuda_launch.c). The cpu and cuda
-# shims define the same symbols, so the archive is rebuilt cleanly for this mode.
-.PHONY: runtime-cuda
-runtime-cuda:
-	$(MAKE) -C $(RUNTIME) clean
-	$(MAKE) -C $(RUNTIME) DEVICE_SHIM=cuda
+$(BUILD)/host.ll: $(HOST_SRC) | $(BUILD)
+	$(PAS) $(FEATURES) --device-backend cuda $< $@
 
-$(EXE): $(BUILD)/host.ll $(BUILD)/dev.ll runtime-cuda
-	clang $(BUILD)/host.ll $(BUILD)/dev.ll $(RUNTIME_LIB) \
+$(EXE): $(BUILD)/host.ll $(BUILD)/dev_ptx_blob.o
+	clang $(BUILD)/host.ll $(BUILD)/dev_ptx_blob.o $(RUNTIME_CUDA) \
 	      -L$(CUDA_HOME)/lib64/stubs -lcuda -o $@
 
 else ifeq ($(DEVICE),cpu)
diff --git a/examples/device_ptx/fill_indices/README.md b/examples/device_ptx/fill_indices/README.md
index 5695f14..e4dca92 100644
--- a/examples/device_ptx/fill_indices/README.md
+++ b/examples/device_ptx/fill_indices/README.md
@@ -106,14 +106,16 @@ From the repository root:
 
 ```bash
 cd examples/device_ptx/fill_indices
-PYTHONPATH=../../../src python3 -m pascal1981.compile_to_ptx \
+PYTHONPATH=../../../src python3 -m pascal1981 --target ptx \
   fill.pas \
   fill.ptx \
   --emit-llvm fill.ll \
-  --cpu sm_70
+  --sm sm_70
 ```
 
-Outputs:
+`--target ptx` on the single `pascal1981` driver replaces the old
+`python -m pascal1981.compile_to_ptx` (still accepted as a deprecated alias;
+`--sm` replaces `--cpu`). Outputs:
 
 ```text
 fill.ll   # intermediate LLVM IR
diff --git a/examples/device_ptx/mandelbrot/README.md b/examples/device_ptx/mandelbrot/README.md
index 8ade651..e7b3f54 100644
--- a/examples/device_ptx/mandelbrot/README.md
+++ b/examples/device_ptx/mandelbrot/README.md
@@ -49,6 +49,14 @@ leaf runtime shim is C. The kernels are unchanged, so the emitted PTX remains th
 drop-in described next. Build rules live in
 [`../device-example.mk`](../device-example.mk).
 
+The GPU build is now three commands (the runtime archive is prebuilt once with
+`make -C runtime cuda`): device unit -> PTX (`--target ptx`); host program ->
+`.ll` (`--device-backend cuda`, which emits no launch thunk and no kernel-symbol
+reference, so there is **no** second device compile); then one `clang` link of
+`host.ll` + the PTX-blob object + `libpascalrt_cuda.a` `-lcuda`. The PTX text is
+packaged as its own NUL-terminated `__pas_device_ptx` data object (a `*_blob.o`,
+**not** `ptxas`/cubin output) that the host references as an external symbol.
+
 ## The ABI being matched
 
 From `mandelbrot.cu`:
@@ -71,15 +79,17 @@ parameters are genuinely 32-bit; `mandelbrot_f64` uses `REAL64` (≡ `REAL`, f64
 ## Build the PTX
 
 ```bash
-PYTHONPATH=src python3 -m pascal1981.compile_to_ptx \
+PYTHONPATH=src python3 -m pascal1981 --target ptx \
   examples/device_ptx/mandelbrot/mandelbrot.pas \
   examples/device_ptx/mandelbrot/mandelbrot.ptx \
-  --emit-llvm examples/device_ptx/mandelbrot/mandelbrot.ll \
-  --cpu sm_86
+  --sm sm_86 -f wide-integers
 ```
 
-This needs `llvmlite`/LLVM with the NVPTX backend; it needs **no** NVIDIA device,
-CUDA driver/runtime, `nvcc`, or the Pascal runtime library.
+`--target ptx` on the single `pascal1981` driver replaces the old
+`python -m pascal1981.compile_to_ptx` (still accepted as a deprecated alias;
+`--sm` replaces `--cpu`). It needs `llvmlite`/LLVM with the NVPTX backend; it
+needs **no** NVIDIA device, CUDA driver/runtime, `nvcc`, or the Pascal runtime
+library.
 
 ## Inspect the artifact
 
diff --git a/runtime/Makefile b/runtime/Makefile
index 607bafe..0da10c4 100644
--- a/runtime/Makefile
+++ b/runtime/Makefile
@@ -1,8 +1,23 @@
 # Makefile for the Pascal-1981 C runtime static library.
 #
 # Usage:
-#   make            Build libpascalrt.a in build/
+#   make            Build the CPU-shim archive (libpascalrt_cpu.a) + the
+#                   back-compat alias libpascalrt.a, in build/.  No GPU/CUDA
+#                   headers required.
+#   make cuda       Also build the CUDA-shim archive (libpascalrt_cuda.a).
+#                   Requires the CUDA toolkit headers ($CUDA_HOME/include).
+#   make both       Build both archives.
 #   make clean      Remove build artifacts
+#
+# The cpu and cuda device shims define the SAME pas_dev_* symbols, so they
+# cannot coexist in one archive.  Rather than rebuild-on-switch (the old
+# DEVICE_SHIM clean-rebuild dance), we compile the shared core once and emit one
+# archive per shim.  Consumers pick the archive at link time; the core objects
+# are never recompiled to switch devices.  A consumer linking libpascalrt_cuda.a
+# must add -lcuda on its final link line.
+#
+# DEVICE_SHIM is still accepted for backward compatibility: `make
+# DEVICE_SHIM=cuda` builds the cuda archive as libpascalrt.a, as before.
 
 CC      := clang
 CFLAGS  := -c -O2 -Wall -Wextra
@@ -11,37 +26,46 @@ ARFLAGS := rcs
 
 SRCDIR  := .
 BUILDDIR:= build
-TARGET  := $(BUILDDIR)/libpascalrt.a
+CUDA_HOME ?= /usr/local/cuda
+
+# Shared core: every .c file except the two device shims (built once, reused by
+# both archives).
+CORE_SRCS := $(filter-out $(SRCDIR)/cpu_device_shim.c $(SRCDIR)/cuda_launch.c,$(wildcard $(SRCDIR)/*.c))
+CORE_OBJS := $(patsubst $(SRCDIR)/%.c,$(BUILDDIR)/%.o,$(CORE_SRCS))
+
+CPU_SHIM_OBJ  := $(BUILDDIR)/cpu_device_shim.o
+CUDA_SHIM_OBJ := $(BUILDDIR)/cuda_launch.o
 
-# Device-orchestration shim selector: cpu (default, CPU stand-in, no GPU) or
-# cuda (real CUDA Driver API).  Both shims define the same pas_dev_* symbols, so
-# exactly one must be in the archive.  The cuda shim needs the CUDA headers to
-# compile; consumers linking the cuda archive must add -lcuda on their final
-# link line (or use the scripts/build-cuda-host.sh recipe).
-DEVICE_SHIM ?= cpu
+CPU_LIB   := $(BUILDDIR)/libpascalrt_cpu.a
+CUDA_LIB  := $(BUILDDIR)/libpascalrt_cuda.a
+ALIAS_LIB := $(BUILDDIR)/libpascalrt.a
 
+.PHONY: all cuda both clean cleaner
+
+# ---- Backward-compatible DEVICE_SHIM override ------------------------------
+# `make DEVICE_SHIM=cuda` builds the cuda archive as the legacy libpascalrt.a.
 ifeq ($(DEVICE_SHIM),cuda)
-SHIM_EXCLUDE := cpu_device_shim.c
-CUDA_HOME    ?= /usr/local/cuda
-CFLAGS       += -I$(CUDA_HOME)/include
-else ifeq ($(DEVICE_SHIM),cpu)
-SHIM_EXCLUDE := cuda_launch.c
+all: $(CUDA_LIB)
+	cp $(CUDA_LIB) $(ALIAS_LIB)
 else
-$(error DEVICE_SHIM must be 'cpu' or 'cuda', got '$(DEVICE_SHIM)')
+# Default: CPU archive + the back-compat alias name (libpascalrt.a == cpu).
+all: $(CPU_LIB)
+	cp $(CPU_LIB) $(ALIAS_LIB)
 endif
 
-# Every .c file in this directory is part of the runtime, except the shim that
-# was not selected.
-SRCS    := $(filter-out $(SRCDIR)/$(SHIM_EXCLUDE),$(wildcard $(SRCDIR)/*.c))
-OBJS    := $(patsubst $(SRCDIR)/%.c,$(BUILDDIR)/%.o,$(SRCS))
-
-.PHONY: all clean
+cuda: $(CUDA_LIB)
+both: $(CPU_LIB) $(CUDA_LIB)
 
-all: $(TARGET)
+$(CPU_LIB): $(CORE_OBJS) $(CPU_SHIM_OBJ) | $(BUILDDIR)
+	$(AR) $(ARFLAGS) $@ $^
 
-$(TARGET): $(OBJS) | $(BUILDDIR)
+$(CUDA_LIB): $(CORE_OBJS) $(CUDA_SHIM_OBJ) | $(BUILDDIR)
 	$(AR) $(ARFLAGS) $@ $^
 
+# The cuda shim is the only object that needs the CUDA headers.
+$(CUDA_SHIM_OBJ): $(SRCDIR)/cuda_launch.c pascalrt.h | $(BUILDDIR)
+	$(CC) $(CFLAGS) -I$(CUDA_HOME)/include -o $@ $<
+
 $(BUILDDIR)/%.o: $(SRCDIR)/%.c pascalrt.h | $(BUILDDIR)
 	$(CC) $(CFLAGS) -o $@ $<
 
@@ -52,4 +76,3 @@ clean:
 	rm -rf $(BUILDDIR)
 
 cleaner: clean
-	rm -f $(TARGET)
diff --git a/scripts/build-cuda-host.sh b/scripts/build-cuda-host.sh
old mode 100644
new mode 100755
index 236717e..4545485
--- a/scripts/build-cuda-host.sh
+++ b/scripts/build-cuda-host.sh
@@ -2,68 +2,64 @@
 #
 # End-to-end recipe: compile a Pascal DEVICE UNIT + host PROGRAM and run the
 # kernel on a real GPU through the CUDA Driver API shim (cuda-kernel-prescription
-# §5.2 Strategy 1).  This is the GPU counterpart of the CPU-device stand-in; the
-# Pascal sources are byte-for-byte the same, only the runtime shim differs.
+# §5.2 Strategy 1).  The Pascal sources are byte-for-byte the same as the CPU
+# stand-in; only the runtime shim differs.
 #
-# Pipeline:
-#   1. device unit  -> PTX  (--device-triple nvptx64-nvidia-cuda, via compile_to_ptx)
-#   2. device unit  -> host x86 .ll  (defines the kernel symbol the host launch
-#                                     thunk references; dead code at run time,
-#                                     the real kernel comes from the PTX)
-#   3. interface    -> .ll
-#   4. host program -> .ll, embedding the PTX via --embed-device-ptx
-#   5. link main.ll + device .ll + the CUDA runtime archive + -lcuda
-#   6. run on the GPU
+# Three commands (the runtime archive is prebuilt, not rebuilt here):
+#   1. device unit -> PTX            (pascal1981 --target ptx)
+#   2. host program -> .ll           (pascal1981 --device-backend cuda)
+#   3. objectify the PTX blob + link (clang)
+#
+# The host is compiled with --device-backend cuda, so it emits no in-process
+# launch thunk and no kernel-symbol reference -- there is no second device
+# compile ('dev.ll').  The PTX text is packaged as its own data object, a
+# NUL-terminated __pas_device_ptx blob the host references as an external symbol;
+# the CUDA shim cuModuleLoadData's it at run time.  (This is a data blob, NOT
+# ptxas/cubin output -- hence _blob.o, never .ptx.o.)
 #
 # Usage:
-#   scripts/build-cuda-host.sh DEVICE_UNIT.pas IFACE.inc HOST_MAIN.pas OUT_EXE \
+#   scripts/build-cuda-host.sh DEVICE_UNIT.pas HOST_MAIN.pas OUT_EXE \
 #       [-- extra pascal1981 feature flags, e.g. -f wide-integers]
 #
 # Requirements: an NVIDIA GPU + driver, llvmlite with the NVPTX backend, clang,
-# and the CUDA toolkit headers.  Build the runtime archive with the CUDA shim
-# first:  make -C runtime DEVICE_SHIM=cuda
+# and the CUDA toolkit headers.  Build the cuda runtime archive once first:
+#   make -C runtime cuda
 set -euo pipefail
 
-if [ "$#" -lt 4 ]; then
+if [ "$#" -lt 3 ]; then
     sed -n '2,30p' "$0"
     exit 2
 fi
 
-DEVICE_UNIT=$1; IFACE=$2; HOST_MAIN=$3; OUT_EXE=$4; shift 4
+DEVICE_UNIT=$1; HOST_MAIN=$2; OUT_EXE=$3; shift 3
 PAS_FLAGS=()
 if [ "${1:-}" = "--" ]; then shift; PAS_FLAGS=("$@"); fi
 
 REPO_ROOT=$(cd "$(dirname "$0")/.." && pwd)
 CUDA_HOME=${CUDA_HOME:-/usr/local/cuda}
 SM=${SM:-sm_89}
-RUNTIME_LIB=$REPO_ROOT/runtime/build/libpascalrt.a
+RUNTIME_CUDA=$REPO_ROOT/runtime/build/libpascalrt_cuda.a
 WORK=$(mktemp -d)
 trap 'rm -rf "$WORK"' EXIT
 
 PAS() { PYTHONPATH="$REPO_ROOT/src" python3 -m pascal1981 "$@"; }
-PTX() { PYTHONPATH="$REPO_ROOT/src" python3 -m pascal1981.compile_to_ptx "$@"; }
 
-# Ensure the CUDA shim is in the runtime archive (rebuild if missing/stale).
-if ! ar t "$RUNTIME_LIB" 2>/dev/null | grep -q '^cuda_launch.o$'; then
-    echo ">> building runtime with the CUDA shim (DEVICE_SHIM=cuda)" >&2
-    make -C "$REPO_ROOT/runtime" clean >/dev/null
-    make -C "$REPO_ROOT/runtime" DEVICE_SHIM=cuda >/dev/null
+if [ ! -f "$RUNTIME_CUDA" ]; then
+    echo ">> building the cuda runtime archive (make -C runtime cuda)" >&2
+    make -C "$REPO_ROOT/runtime" cuda >/dev/null
 fi
 
 echo ">> 1. device unit -> PTX" >&2
-PTX "$DEVICE_UNIT" "$WORK/dev.ptx" --cpu "$SM" "${PAS_FLAGS[@]}"
-
-echo ">> 2. device unit -> host .ll (defines the kernel symbol)" >&2
-PAS "${PAS_FLAGS[@]}" "$DEVICE_UNIT" "$WORK/dev.ll" >/dev/null
-
-echo ">> 3. interface -> .ll" >&2
-PAS "${PAS_FLAGS[@]}" "$IFACE" "$WORK/iface.ll" >/dev/null
+PAS --target ptx "$DEVICE_UNIT" "$WORK/dev.ptx" --sm "$SM" "${PAS_FLAGS[@]}" >/dev/null
 
-echo ">> 4. host program -> .ll (embedding PTX)" >&2
-PAS "${PAS_FLAGS[@]}" --embed-device-ptx "$WORK/dev.ptx" "$HOST_MAIN" "$WORK/main.ll" >/dev/null
+echo ">> 2. host program -> .ll (device-backend cuda)" >&2
+PAS "${PAS_FLAGS[@]}" --device-backend cuda "$HOST_MAIN" "$WORK/host.ll" >/dev/null
 
-echo ">> 5. link host + device .ll + CUDA shim" >&2
-clang "$WORK/main.ll" "$WORK/dev.ll" "$RUNTIME_LIB" \
+echo ">> 3. objectify PTX blob + link" >&2
+printf '\t.section .rodata\n\t.globl __pas_device_ptx\n__pas_device_ptx:\n\t.incbin "%s"\n\t.byte 0\n' \
+    "$WORK/dev.ptx" > "$WORK/dev_ptx_blob.s"
+clang -c "$WORK/dev_ptx_blob.s" -o "$WORK/dev_ptx_blob.o"
+clang "$WORK/host.ll" "$WORK/dev_ptx_blob.o" "$RUNTIME_CUDA" \
     -L"$CUDA_HOME/lib64/stubs" -lcuda -o "$OUT_EXE"
 
-echo ">> 6. done: $OUT_EXE" >&2
+echo ">> done: $OUT_EXE" >&2
diff --git a/src/pascal1981/codegen/__init__.py b/src/pascal1981/codegen/__init__.py
index 1dd9372..c841cfe 100644
--- a/src/pascal1981/codegen/__init__.py
+++ b/src/pascal1981/codegen/__init__.py
@@ -48,7 +48,8 @@ def __init__(self,
                  host_triple: str = "x86_64-pc-linux-gnu",
                  is_root_compiland: bool = True,
                  is_device_compiland: bool = False,
-                 embed_device_ptx_text: Optional[str] = None):
+                 embed_device_ptx_text: Optional[str] = None,
+                 device_backend: str = 'cpu'):
         """Initialize Codegen with all mixins."""
         super().__init__(verbose=verbose,
                          source_file=source_file,
@@ -58,7 +59,8 @@ def __init__(self,
                          host_triple=host_triple,
                          is_root_compiland=is_root_compiland,
                          is_device_compiland=is_device_compiland,
-                         embed_device_ptx_text=embed_device_ptx_text)
+                         embed_device_ptx_text=embed_device_ptx_text,
+                         device_backend=device_backend)
 
     # ========================================================================
     # Type System
@@ -87,6 +89,7 @@ def compile_to_llvm(
         device_triple: str = "x86_64-pc-linux-gnu",
         host_triple: str = "x86_64-pc-linux-gnu",
         embed_device_ptx_text: Optional[str] = None,
+        device_backend: str = 'cpu',
         # Legacy compat: force_rangeck=True/False is equivalent to
         # force_flags={'RANGECK': True/False}.
         force_rangeck: Optional[bool] = None) -> str:
@@ -114,7 +117,8 @@ def compile_to_llvm(
                       host_triple=host_triple,
                       is_root_compiland=is_root_compiland,
                       is_device_compiland=is_device_compiland,
-                      embed_device_ptx_text=embed_device_ptx_text)
+                      embed_device_ptx_text=embed_device_ptx_text,
+                      device_backend=device_backend)
     module = codegen.codegen(ast)
     return str(module)
 
diff --git a/src/pascal1981/codegen/base.py b/src/pascal1981/codegen/base.py
index 01c81ee..4514e34 100644
--- a/src/pascal1981/codegen/base.py
+++ b/src/pascal1981/codegen/base.py
@@ -100,7 +100,8 @@ def __init__(self,
                  host_triple: str = "x86_64-pc-linux-gnu",
                  is_root_compiland: bool = True,
                  is_device_compiland: bool = False,
-                 embed_device_ptx_text: Optional[str] = None):
+                 embed_device_ptx_text: Optional[str] = None,
+                 device_backend: str = 'cpu'):
         # Each compilation gets its own LLVM context. Identified struct types
         # (used for named records, so self-referential linked-list nodes can
         # build) are interned by name *within a context*; the default global
@@ -211,6 +212,13 @@ def __init__(self,
         # embedding *mechanism* is always present so the GPU swap is a runtime
         # change, but the CPU-device path never executes the PTX.
         self._embed_device_ptx_text: Optional[str] = embed_device_ptx_text
+        # Host launch backend: 'cpu' (CPU-device stand-in) emits the per-kernel
+        # dispatch thunk + registry that resolves and calls the kernel in-process;
+        # 'cuda' targets the real CUDA Driver API shim, where the kernel is the
+        # loaded PTX module and the host never references the kernel symbol -- so
+        # the thunk/registry (and the dead link-time kernel reference they force,
+        # i.e. the second 'dev.ll' device compile) are suppressed entirely.
+        self.device_backend: str = device_backend
         self._build_extern_factories()
         # INPUT/OUTPUT: only PROGRAM owns the strong definition; MODULE and
         # UNIT compilands emit declare-only (external global) so the linker
diff --git a/src/pascal1981/codegen/stmts.py b/src/pascal1981/codegen/stmts.py
index 988c8f6..0cb3da3 100644
--- a/src/pascal1981/codegen/stmts.py
+++ b/src/pascal1981/codegen/stmts.py
@@ -502,7 +502,15 @@ def _codegen_device_orchestration(self, name: str, args: list) -> None:
         # get_function returns the thunk, and launch calls it; on the GPU the
         # same three calls become cuModuleLoadData(ptx) / cuModuleGetFunction /
         # cuLaunchKernel, with no change here.
-        self._record_launched_kernel(fn.name, self._kernel_launch_thunk(fn))
+        # On the CPU-device backend the launch is dispatched in-process through a
+        # per-kernel thunk recorded in this compiland's registry; that thunk
+        # statically references the kernel symbol, which is what forces the
+        # separate host-ABI device compile (dev.ll) at link time.  On the CUDA
+        # backend the kernel is the loaded PTX module and the shim dispatches it
+        # by name, so we emit neither thunk nor registry -- the host .ll then has
+        # no undefined kernel symbol and needs no dev.ll.
+        if self.device_backend != 'cuda':
+            self._record_launched_kernel(fn.name, self._kernel_launch_thunk(fn))
         module = self.builder.call(
             self.runtime_extern('pas_dev_module_load'),
             [self._launch_registry_ptr(), self._device_ptx_ptr()])
@@ -527,6 +535,12 @@ def _launch_registry_ptr(self) -> ir.Value:
         """
         i8p = ir.IntType(8).as_pointer()
         i64 = ir.IntType(64)
+        # CUDA backend: there is no in-process registry (the kernel is the loaded
+        # PTX module and the shim ignores this argument), so pass a null pointer
+        # rather than referencing an external registry global that nothing
+        # defines -- which would otherwise be an undefined symbol at link.
+        if self.device_backend == 'cuda':
+            return ir.Constant(i8p, None)
         if self._launch_registry_gv is None:
             reg_ty = ir.LiteralStructType([i8p.as_pointer(), i8p.as_pointer(), i64])
             self._launch_registry_gv = ir.GlobalVariable(
@@ -545,7 +559,20 @@ def _device_ptx_ptr(self) -> ir.Value:
         """
         i8 = ir.IntType(8)
         i8p = i8.as_pointer()
+        zero = ir.Constant(ir.IntType(32), 0)
         if self._device_ptx_gv is None:
+            if self.device_backend == 'cuda' and not self._embed_device_ptx_text:
+                # CUDA backend, decoupled packaging: the PTX blob is its own
+                # object (built from the .ptx at link time), referenced here as
+                # an external `const char __pas_device_ptx[]`.  The host .ll no
+                # longer needs the kernel text baked in, so host compile does not
+                # depend on the device artifact.
+                gv = ir.GlobalVariable(self.module, ir.ArrayType(i8, 0),
+                                       name='__pas_device_ptx')
+                gv.global_constant = True
+                gv.linkage = 'external'
+                self._device_ptx_gv = gv
+                return self.builder.bitcast(gv, i8p)
             text = self._embed_device_ptx_text or ''
             data = bytearray(text.encode('utf-8') + b'\0')
             const = ir.Constant(ir.ArrayType(i8, len(data)), data)
@@ -553,7 +580,9 @@ def _device_ptx_ptr(self) -> ir.Value:
             gv.global_constant = True
             gv.initializer = const
             self._device_ptx_gv = gv
-        zero = ir.Constant(ir.IntType(32), 0)
+        if isinstance(self._device_ptx_gv.type.pointee, ir.ArrayType) and \
+                self._device_ptx_gv.type.pointee.count == 0:
+            return self.builder.bitcast(self._device_ptx_gv, i8p)
         return self.builder.gep(self._device_ptx_gv, [zero, zero])
 
     def _emit_launch_registry(self) -> None:
diff --git a/src/pascal1981/compile_to_llvm.py b/src/pascal1981/compile_to_llvm.py
index 8275211..bd75c2f 100644
--- a/src/pascal1981/compile_to_llvm.py
+++ b/src/pascal1981/compile_to_llvm.py
@@ -45,6 +45,28 @@ def main() -> int:
                         help='LLVM target triple for DEVICE MODULE units; e.g. nvptx64-nvidia-cuda or '
                         'amdgcn-amd-amdhsa. Defaults to the host x86 triple (CPU-device: address '
                         'spaces collapse to addrspace 0).')
+    parser.add_argument('--target',
+                        choices=['host', 'ptx'],
+                        default='host',
+                        help='Output target: host LLVM IR (.ll, default) or device NVPTX assembly '
+                        '(.ptx). --target ptx selects the NVPTX device triple and honors --sm; it '
+                        'is the single-CLI replacement for python -m pascal1981.compile_to_ptx.')
+    parser.add_argument('--sm',
+                        default='sm_70',
+                        metavar='ARCH',
+                        help='NVPTX target CPU for --target ptx, e.g. sm_70, sm_86 (default: sm_70).')
+    parser.add_argument('--emit-llvm',
+                        default=None,
+                        metavar='PATH',
+                        help='With --target ptx, also write the intermediate NVPTX LLVM IR to PATH.')
+    parser.add_argument('--device-backend',
+                        choices=['cpu', 'cuda'],
+                        default='cpu',
+                        help='Host launch backend for LAUNCH lowering. cpu (default): emit the '
+                        'in-process dispatch thunk + registry (CPU-device stand-in). cuda: target '
+                        'the CUDA Driver API shim -- the kernel is the loaded PTX module, so no '
+                        'thunk/registry and no dead kernel-symbol reference are emitted (no dev.ll '
+                        'needed at link).')
     parser.add_argument('--embed-device-ptx',
                         default=None,
                         metavar='PTX_FILE',
@@ -84,6 +106,42 @@ def main() -> int:
         print(runtime_lib_path())
         return 0
 
+    if args.target == 'ptx':
+        # Single-CLI device path: parse/check/lower to NVPTX IR, then PTX.
+        from .compile_to_ptx import compile_file_to_ptx
+        try:
+            features = resolve_features(args.dialect, args.feature)
+        except ValueError as exc:
+            parser.error(str(exc))
+        if not args.source_file:
+            parser.error('--target ptx requires a source file')
+        try:
+            device_triple = args.device_triple
+            if device_triple == 'x86_64-pc-linux-gnu':
+                device_triple = 'nvptx64-nvidia-cuda'
+            ptx = compile_file_to_ptx(
+                args.source_file,
+                host_triple=args.host_triple,
+                device_triple=device_triple,
+                cpu=args.sm,
+                features=features,
+                emit_llvm_path=args.emit_llvm,
+            )
+            if args.output_file:
+                with open(args.output_file, 'w') as f:
+                    f.write(ptx)
+                print(f'Wrote {args.output_file}', file=sys.stderr)
+            else:
+                print(ptx)
+            return 0
+        except Exception as exc:
+            print(f'Error: {exc}', file=sys.stderr)
+            if args.verbose:
+                traceback.print_exc()
+            else:
+                print('(re-run with -v for a full traceback)', file=sys.stderr)
+            return 1
+
     if args.list_features:
         for feature in all_features():
             print(f'{feature.name}\tdefault={str(feature.default).lower()}\t{feature.help}')
@@ -164,7 +222,8 @@ def main() -> int:
                              features=features,
                              host_triple=args.host_triple,
                              device_triple=args.device_triple,
-                             embed_device_ptx_text=embed_device_ptx_text)
+                             embed_device_ptx_text=embed_device_ptx_text,
+                             device_backend=args.device_backend)
 
         # Output
         if output_file:
diff --git a/tests/test_device_ptx_module.py b/tests/test_device_ptx_module.py
index 324e09e..bd26c51 100644
--- a/tests/test_device_ptx_module.py
+++ b/tests/test_device_ptx_module.py
@@ -79,7 +79,7 @@
 """
 
 
-def _compile_main_ir(proj_files, *, embed_ptx=None):
+def _compile_main_ir(proj_files, *, embed_ptx=None, device_backend='cpu'):
     """Compile main.pas of a project to host IR, optionally embedding PTX."""
     with temporary_pascal_project(proj_files) as proj:
         main_path = os.path.join(proj, 'main.pas')
@@ -87,7 +87,8 @@ def _compile_main_ir(proj_files, *, embed_ptx=None):
         result = PascalTypeChecker(source_file=main_path, features=_WIDE).check(ast)
         assert result.success, result.errors
         return compile_to_llvm(ast, source_file=main_path, features=_WIDE,
-                               embed_device_ptx_text=embed_ptx)
+                               embed_device_ptx_text=embed_ptx,
+                               device_backend=device_backend)
 
 
 @requires_llvm
@@ -125,6 +126,50 @@ def test_empty_blob_emitted_without_ptx(self):
         self.assertIn('@"__pas_device_ptx" = constant [1 x i8]', ir)
 
 
+@requires_llvm
+class TestCudaBackendDecoupling(unittest.TestCase):
+    """--device-backend cuda removes the CPU stand-in machinery entirely.
+
+    On the CUDA backend the kernel is the loaded PTX module and the shim
+    dispatches it by name, so the host must NOT emit the per-kernel dispatch
+    thunk, the registry, or any reference to the kernel symbol -- those were the
+    only reason the device unit had to be compiled a second time (dev.ll) and
+    linked into the host.  The PTX blob is referenced as an external symbol
+    (its own object at link time), so host compile no longer depends on the
+    device artifact.
+    """
+
+    def test_no_thunk_no_registry_no_kernel_ref(self):
+        ir = _compile_main_ir({'vadd.inc': _IFACE, 'main.pas': _MAIN},
+                              device_backend='cuda')
+        # The three-step driver path is still emitted...
+        self.assertIn('pas_dev_module_load', ir)
+        self.assertIn('pas_dev_module_get_function', ir)
+        self.assertIn('pas_dev_launch', ir)
+        # ...but with none of the CPU stand-in scaffolding.
+        self.assertNotIn('__pas_klaunch', ir)         # no thunk, no registry
+        self.assertNotIn('define i32 @"add"', ir)     # no kernel definition
+        # The kernel symbol is never *referenced* (an unused extern declare is
+        # harmless; a call/thunk would force the dead dev.ll link).
+        self.assertNotIn('call void @"add"', ir)
+
+    def test_ptx_blob_is_external_not_embedded(self):
+        ir = _compile_main_ir({'vadd.inc': _IFACE, 'main.pas': _MAIN},
+                              device_backend='cuda')
+        # Host references the blob as an external symbol; the bytes live in a
+        # separate object built from the .ptx at link time.
+        self.assertIn('@"__pas_device_ptx" = external constant', ir)
+
+    def test_explicit_embed_still_wins_on_cuda_backend(self):
+        # Legacy opt-in: --embed-device-ptx still bakes the bytes in even on the
+        # cuda backend (two-input link), so the old path keeps working.
+        ptx = '.visible .entry add() { ret; }\n'
+        ir = _compile_main_ir({'vadd.inc': _IFACE, 'main.pas': _MAIN},
+                              embed_ptx=ptx, device_backend='cuda')
+        self.assertNotIn('external constant', ir.split('__pas_device_ptx')[1][:40])
+        self.assertIn('visible .entry add', ir)
+
+
 @requires_llvm
 class TestRegistryDedup(unittest.TestCase):
 

From 00073fa7cf73075503397dcb075dc2a54a1d4a06 Mon Sep 17 00:00:00 2001
From: Dixie Flatline a/k/a McCoy Pauley <ubuntu@localhost>
Date: Sat, 27 Jun 2026 04:04:43 +0000
Subject: [PATCH 03/10] Record as-built status in device-build-cleanup plan

---
 docs/device-build-cleanup-plan.md | 37 ++++++++++++++++++++++++++++++-
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/docs/device-build-cleanup-plan.md b/docs/device-build-cleanup-plan.md
index 84e1c51..cbeabe1 100644
--- a/docs/device-build-cleanup-plan.md
+++ b/docs/device-build-cleanup-plan.md
@@ -1,6 +1,41 @@
 # Plan: collapse the GPU device-build pipeline to three commands
 
-Status: PROPOSED. No code changed yet — this is the design.
+Status: IMPLEMENTED (commit 47ba728), with one deliberately-deferred optional
+item. See §7 for the as-built status against this design.
+
+## 7. As-built status
+
+Landed as planned:
+
+- **§3.1 — `dev.ll` killed.** `--device-backend cuda` suppresses the
+  `__pas_klaunch_*` thunk and registry and passes a null registry pointer; host
+  `.ll` carries no kernel-symbol reference, so no second device compile is
+  linked. Verified by grep + `ld -r` resolution.
+- **§3.2 — PTX decoupled.** Host references an external `__pas_device_ptx`
+  symbol; the PTX text is packaged as a NUL-terminated `*_blob.o` via an
+  `.incbin` assembly stub. `--embed-device-ptx` retained as a legacy opt-in.
+- **§3.5 — both runtime archives prebuilt** (`libpascalrt_{cpu,cuda}.a`, two
+  full archives in one `make`; the "simpler" variant). `runtime-cuda`
+  clean-rebuild phony deleted.
+- **§4 / §5 — build files + migration.** `device-example.mk` and
+  `build-cuda-host.sh` reduced to the three-command flow; `compile_to_ptx`,
+  `--embed-device-ptx`, and the CPU path all still work; PTX ABI unchanged.
+- **§6 — validation (2 of 3 rungs).** New `--target ptx` output is byte-identical
+  to the pre-change tree (diffed against 571c9bb); a regression test pins
+  "no thunk / no kernel ref / external PTX symbol" on the cuda backend.
+
+Deviations / not done:
+
+- **§3.3 (optional `ptxas`/cubin route) — NOT implemented.** Marked optional; no
+  `ptxas` driving or cubin embedding was added. Future add-on.
+- **§3.4 — built in the reverse direction.** Rather than making `compile_to_ptx`
+  forward to `--target ptx`, `--target ptx` calls into
+  `compile_to_ptx.compile_file_to_ptx` and the old CLI is kept intact as the
+  alias. Functionally identical (single driver, shared flags, byte-identical
+  PTX); only the dependency direction differs from the text above.
+- **§6 on-GPU run — environmentally blocked.** No NVIDIA device/`ptxas` in the
+  dev VM, so the final "link + run on a GPU box" rung and the `ptxas` text
+  checks remain unexecuted here.
 
 ## 1. Where the bodies are buried (current state)
 

From ef1acd3df9e4e86e119de99b6f3715bc500c00bb Mon Sep 17 00:00:00 2001
From: Dixie Flatline a/k/a McCoy Pauley <ubuntu@localhost>
Date: Sat, 27 Jun 2026 04:22:32 +0000
Subject: [PATCH 04/10] Update RUNNING_PTX.md to the unified --target ptx CLI

---
 examples/device_ptx/fill_indices/RUNNING_PTX.md | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/examples/device_ptx/fill_indices/RUNNING_PTX.md b/examples/device_ptx/fill_indices/RUNNING_PTX.md
index 65fcc41..64833c0 100644
--- a/examples/device_ptx/fill_indices/RUNNING_PTX.md
+++ b/examples/device_ptx/fill_indices/RUNNING_PTX.md
@@ -33,13 +33,17 @@ not the same as a successful CUDA launch.  The machine knows when you lie.
 From this example directory in the Pascal repository:
 
 ```bash
-PYTHONPATH=../../../src python3 -m pascal1981.compile_to_ptx \
+PYTHONPATH=../../../src python3 -m pascal1981 --target ptx \
   fill.pas \
   fill.ptx \
   --emit-llvm fill.ll \
-  --cpu sm_70
+  --sm sm_70
 ```
 
+(`--target ptx` on the single `pascal1981` driver replaces the old
+`python -m pascal1981.compile_to_ptx`, still accepted as a deprecated alias;
+`--sm` replaces `--cpu`.)
+
 Inspect:
 
 ```bash

From 7713a86313c6c928f7e99fbfe66b5e1ba43d4c1b Mon Sep 17 00:00:00 2001
From: Dixie Flatline a/k/a McCoy Pauley <ubuntu@localhost>
Date: Sat, 27 Jun 2026 04:47:51 +0000
Subject: [PATCH 05/10] CPU device emulates full GPU launch geometry via TLS
 index registers

THREADIDX_*/BLOCKIDX_*/BLOCKDIM_*/GRIDDIM_* on the CPU triple now lower
to loads from _Thread_local globals (__pas_tid_x, __pas_ctaid_x, etc.)
instead of baked-in constants. pas_dev_launch loops over the full
gx*gy*gz x bx*by*bz geometry, setting those registers before each thunk
call -- the same semantic a GPU provides via hardware special registers.
BLOCKDIM_*/GRIDDIM_* default to 1 so direct (non-LAUNCH) calls retain
the old single-thread behaviour.

- codegen/exprs.py: CPU-triple builtins emit TLS loads, not constants
- runtime/cpu_device_shim.c: define 12 TLS vars; loop in pas_dev_launch
- examples/device_ptx/device-example.mk: wire DEVICE=cpu build+link
- tests: update index-intrinsic test; add shim to mandelbrot_x86 link
- CPU_DEVICE_TODO.md: marked done

Verified: fill_indices OK all 256, mandelbrot full image -- no kernel
changes, PTX output unchanged.
---
 examples/device_ptx/CPU_DEVICE_TODO.md        |  25 ++++++++-
 examples/device_ptx/device-example.mk         |  26 +++++----
 examples/device_ptx/fill_indices/fill_host    | Bin 0 -> 18240 bytes
 .../device_ptx/mandelbrot/mandelbrot_host     | Bin 0 -> 22520 bytes
 runtime/cpu_device_shim.c                     |  51 +++++++++++++++---
 src/pascal1981/codegen/exprs.py               |  42 ++++++++++++---
 .../integration/test_device_mandelbrot_x86.py |  10 ++--
 tests/test_device_index_intrinsics.py         |  10 ++--
 8 files changed, 132 insertions(+), 32 deletions(-)
 create mode 100755 examples/device_ptx/fill_indices/fill_host
 create mode 100755 examples/device_ptx/mandelbrot/mandelbrot_host

diff --git a/examples/device_ptx/CPU_DEVICE_TODO.md b/examples/device_ptx/CPU_DEVICE_TODO.md
index 08e2eb6..6fa198f 100644
--- a/examples/device_ptx/CPU_DEVICE_TODO.md
+++ b/examples/device_ptx/CPU_DEVICE_TODO.md
@@ -1,4 +1,4 @@
-# CPU device support for the `device_ptx` examples — future work
+# CPU device support for the `device_ptx` examples — DONE
 
 The Makefiles in `fill_indices/` and `mandelbrot/` accept `DEVICE=cpu` and
 `DEVICE=cuda`. Only `DEVICE=cuda` is wired today; `DEVICE=cpu` prints a pointer
@@ -31,7 +31,28 @@ Both example kernels are one-thread-per-element:
 
 This is a property of the kernels, not the orchestration or the shim.
 
-## What enabling CPU needs: grid-stride kernels
+## How it was fixed (implemented)
+
+Rather than changing the kernels, the CPU shim was made to actually emulate GPU
+execution:
+
+1. **Compiler (`codegen/exprs.py`)**: on the CPU triple, `THREADIDX_*`,
+   `BLOCKIDX_*`, `BLOCKDIM_*`, `GRIDDIM_*` now lower to **loads from
+   thread-local globals** (`__pas_tid_x`, `__pas_ctaid_x`, etc.) instead of
+   baked-in constants. The runtime defines these.
+
+2. **CPU shim (`runtime/cpu_device_shim.c`)**: `pas_dev_launch` now loops over
+   the full launch geometry (`gx*gy*gz` blocks × `bx*by*bz` threads), setting
+   the TLS index registers before each thunk call. `BLOCKDIM_*`/`GRIDDIM_*`
+   default to 1 so direct (non-LAUNCH) kernel calls still work.
+
+3. **Makefile (`device-example.mk`)**: the `DEVICE=cpu` stub now builds and
+   links `dev.ll` + `host.ll` against `libpascalrt_cpu.a`.
+
+The kernels are unchanged. `make DEVICE=cpu run` now produces correct output for
+both `fill_indices` (all 256 indices correct) and `mandelbrot` (full image).
+
+## What was previously needed (now moot): grid-stride kernels
 
 Make each kernel iterate its whole index space with a grid-stride loop instead of
 handling a single element. For a 1-D kernel:
diff --git a/examples/device_ptx/device-example.mk b/examples/device_ptx/device-example.mk
index 2e9901d..6d071b6 100644
--- a/examples/device_ptx/device-example.mk
+++ b/examples/device_ptx/device-example.mk
@@ -70,15 +70,23 @@ $(EXE): $(BUILD)/host.ll $(BUILD)/dev_ptx_blob.o
 	      -L$(CUDA_HOME)/lib64/stubs -lcuda -o $@
 
 else ifeq ($(DEVICE),cpu)
-# ---- CPU device: FUTURE WORK (see CPU_DEVICE_TODO.md) -----------------------
-# The host orchestration already works on the CPU shim; what's missing is kernel
-# coverage. The CPU device runs a single-thread grid, so a one-thread-per-element
-# kernel computes only element 0. Enabling this is a kernel change, deferred.
-$(EXE):
-	@echo "DEVICE=cpu is not yet wired for this example."                    >&2
-	@echo "See examples/device_ptx/CPU_DEVICE_TODO.md for why and what it"   >&2
-	@echo "needs. For now, build and run on a GPU with:  make DEVICE=cuda"   >&2
-	@false
+# ---- CPU device: full-grid emulation via thread-local index registers -------
+# The CPU shim now emulates a GPU launch: pas_dev_launch loops over the full
+# gx*gy*gz x bx*by*bz grid, setting thread-local __pas_tid_*/  __pas_ctaid_*
+# globals before each thunk call so the kernel sees the correct indices.
+# The device unit compiles to the host triple (no PTX), and links alongside
+# the host .ll against libpascalrt_cpu.a. No GPU or CUDA toolkit required.
+#
+# Build the cpu runtime archive once with:  make -C runtime
+# (this Makefile does not rebuild it on every example build).
+$(BUILD)/dev.ll: $(DEVICE_UNIT) | $(BUILD)
+	$(PAS) $(FEATURES) $< $@
+
+$(BUILD)/host.ll: $(HOST_SRC) | $(BUILD)
+	$(PAS) $(FEATURES) $< $@
+
+$(EXE): $(BUILD)/host.ll $(BUILD)/dev.ll
+	clang $(BUILD)/host.ll $(BUILD)/dev.ll $(RUNTIME_LIB) -lm -o $@
 
 else
 $(error DEVICE must be 'cpu' or 'cuda', got '$(DEVICE)')
diff --git a/examples/device_ptx/fill_indices/fill_host b/examples/device_ptx/fill_indices/fill_host
new file mode 100755
index 0000000000000000000000000000000000000000..e56b03f3085a78c87f086ee0bf74c8aafcf89ea9
GIT binary patch
literal 18240
zcmeHPeQ;dWb-ycr31Vy4V1xJyFV4y}1}|EFRkl&swYBD{NFa=zzz|rk_Cs1&X_wu%
zMmDYykwvvx6q}9-o^${=$pFJJiJQ8_5Gd;pEI|xoQ<AttrZof-sWQ|wK!Ad`zjN<B
ztJOnlJLz=VKXhl_JNNwVIp>~x?tSmwci-N<C(yLE!r>5{riiZz;tDQ>_;f+xYdQns
z6LrFg^E~ktF&+F=iK%j*A}CeKoHCYHYMfxwdrFtv>BU-(DSJqi^d?HXbp=d?ZU^a2
zBPGq->4vkFMW$R{s?Sj6!bT14KhttOicR|>rPp3RmJU@YzIukDtVb~E?bmwywH{NU
z^_X%zDJFFEX@Bh0jWSA1NtO~(I!?|b?u*ors3<ktUZ41?7O+vf9@j_BrTj#3sn+Id
zHD$dU-?1WpjGrx9Z;zo2+Nq`MVaoO006pr5|8C-8wNBSJQTul5{$uLU6;Qo%a8UPM
zbKM59a7W_vZx@!nvgWR5_f@Uh_3F{Kzy0jWM7(wR(kl~@MTvN4wrf#W-SS1tmwGZu
z&tf?oW<i(iDZlSUP6)y%9gn`+p>pG&*8)CgoT8g2z;{f5Q~d@Gd)fkEJo^t$fcH&+
zUjp2~VNcZB@$9dg0Kawu{0iU(4tu%-fWAb=76lvEHwPopbhIs=v7+hq&1(|L&glA3
zYa%Lw!M2WMXE0-h(pE4ir0BSUv2-*lVzERv(=Iwfi9|9CUG#QX>vn|Voj}4}p<pcD
z8A`;rL`5nc@3dl~Bia#8Z5A0T9qvepP-`-6LDEWOf^AVN7>Y#F0){$hlq{!QSm(wY
zMAtMmu3i&d>{%i>=UrOLE*Z@(_AH|yRA98yaIU}=O>`R0il=et5c`4Abz$H*Hy+1O
zoGM;Ho{|H)KLX|JX4eDX#ju?wex<Q}lAj{p(tP&9_*qobyWlBx>-!?#kM_(~c%Kd5
zsPUaP{0YtPw&8m<&*g0Y0Ylm6a>h@er*PA!DC5lUzgppRO;T()70PhF=TO;T8P2u{
zKURj%BtWMVWjF#nntUot_glv(Dk{rx{iu+ot}<LcC$&y>8LsXts=z42`4L7X3(9bG
zSSi((;b)gnA?nKTFO}hw*~th@Mqn}mlM$GVz+?pe_ag9n=a)Y<drnoF`DxEwDTLWu
zuqukj%$~!Qhh(M2WiJCR)*Qvzd8rRM;@hcg_+$~O<{<Gj#TY&&`Dclzsm^dg@;@P-
zrW(V0CI2MxH02rIBl+(WPg9QJos$0^@if&K?w0&t5>Hc%;gsa(fsgG3FBAFieYY`p
z{5CW9mf7?6>6Y~YZ^8SL*<XJg0>!!i;1S}3n8$hPb}0}1W*9L0r>!Hh;$v$rn*KC|
zsOe`FPdYE{Cc#6RLjE!-zl`cNFFRxAPMZhbyV^YP@f6c>#5{h+ngav9+CXLTWXySK
z!>B&_*IoY(ibeMF4Q5Y$D`m~xJJw7yUw;J5&}C@WPz0wV)1C+KxD|Dm>L>l-yHKe8
z*#K4Z-c!e5WFRm|VIK$_RfI5buM;u*0|RC*aKao2oRs`PV2J2I;4}#z34DO;$Fih<
z+D)j{>~9`4bCm;uK4t8FMP!@@0=txC;4ffvH*D^a{6OFdq62}w%H}>~Y0gj4ovCd0
z!RCH?;m-xSy@l<CtT7_;f$muD0AxP(q9=2e$h}9o0ceg?c?-FLq3LH3lE41ESVY-t
zRF;Qe?-qn%e-To--mS=b3ws^g3(nrBQ2X%T;{@TyA#CP)O$yjg1uQ*>y3O1{GcPf-
zf7abnqQ3<aW^T5bUt#tyYg>fD4e^6R50iNQK`Px(je@O>l&3i~^6gUVO*1!W_7qM7
zpdo4I4x9OvX70CU?%&LwgMeQ*bK5{}!11Qpa{%P%0KE{wgh|bP2s3yn-$cs&^=B;v
z*S~BBz@ZLV|KG_Hc;Bg5a-k`&Yac*cs9OIs(!rinuBP0}O}UrMfvIlga)Y~aXdh|z
z9CpEe4)3(6^jozc`C6S*&3zTuC4e?rR5$d`ij9Dky@8_joF9Hjb?h1f@|Cfk0}d1~
zBQ94tydE6}cD^pm^{<78v7S$a^Wj45-cO3c`N#od@CiruPevMIdD)Zse(J>7iotD1
zhF3#fcKoc*(D6_FJMg?T;eD;=U#OlJFNdSf7hcNE>62H36g~;Lp}TQ{z*D$7)hPKI
ztm;wXLspHF?Cm345n~TtF%S_S$X+qjqEzoBRT_N5CsY$RN?Y7Sx=)~qBi}<UYjsJk
z|EdYv=N`ER?IR!Z{Zzed-^h09Z!P)z^G{{tNFN5@rM^eI@>BaTHh+hU#`(foD+aT!
zk;|mY=a3%SjT2;FK`Aw{SirTg68g@ag*#`AR1Y^&J^6luAoduE!QUZi;5jnz4P~H@
zN_y&MY+H#KG~g`^F=$dncd8;`-81}s6@*)4Z5)JVZS)yb;MXO&{yM67oKCJ7SqOiq
z6yj%d6PKci^@DgReAYk}sQcI_#p1{=>E>K=b1!&w(o+yYFn%IQ{d1&1*A5gOg@CtU
z=1$#u2y#zKfoV_E#rpyV`m8+wn)+uwq7)wo@yiPrno+9iFz%}@X6|D%kH_5#Gw-X+
z*ACH3=S_!j)838@D(*Q|ZRW0W=PY++#LQ2vkzjxK5E_#gxdykZ9Q}aNndRs)LZP2;
za?ixQCl96kInE!y>==apoE4UP&YzzH({7Z{$**_MF~c;j(XgJw)rUqORq{Dv_7<Hx
ze)lo88;=Dao>E2Uu7cV7k#pC9$^&NaYtBasm4nXh`yfKkjfu*4>GI)K$izk~K{yV@
z%J>cbZ}`9N-{4<=W6M}m+2+RFnWlW3tGV~>ZT`mGADZ(X?;GBCN9IDk%wF-u#ucZm
zd5!t4?~*gtdri4BL(@@dZz20uV}30{fAP>rh4&3JcY>xFxA?#5ztw-6KS+-~pBSwl
zJyF+`YjPVVP7BQ39Jh&1agkTs@4<Lz&K)rk!BnInx3q>rZX(b|BsaC@buf|K%$gG#
zbP@P#$xL-q{-iEgK+N03^cI}oSqP(M?(?SnR(F><fSV|F{m6W3URnE4Wu;j>Ku%Sf
ztKgY)`}=a-T;&$q-Xy(hsds^z2at=^TE5y`HPdZ9Z}x0;yTrD4yoC|hPI9`|EEddF
z7$w$m6C<RSyrPjsW!1zWdXlLm26xCJnMjFk2Ls->N6sJPUt{rb)2fI&Wj#hRhS>IA
z0DW|BX?!;pUu;^Ha2wVyn|nXn_P+NG*}T#AMRFI{&{#)Quc$etGVYqw$Ta1O#b21k
zLyar`Df^EjKSkfwmGxa|9$A_>QWPfBWCSK7Fd2c#2>gGIfMd*83Cs`Ui9|5o8HtCZ
z8S=TDUuneR>5OI68TE#!$%vXrO<0^QuD^Drfp1oZY@(rcjc_uZj)pDBK`e8&SnF?W
zT4@L)8N(@>h;~Fft&DM3I@#H#Wt3cbg|O(jXv!-3HcQ_&y8pFUq&H3%9!JMO4bYv$
zzmLfyXv;^%;!HfmQ=kh##UF}AAE*m-J!mZ`tyHvtZUx;4x)U@7`a{supr=47(YIJi
zjxFniqs!&EXy%Mck|mselh6lW=956f3>OhMj-@z`y{CK<Rj##F)qm#v(p{C^;_3@m
zEnMQBM{=b9O&oR5mlF&EkqjIbQM9R^Rzv(%uKOz1%$h!B$~08%pE=znXkxJ&$5kH|
zi|2!q%!Vpuh3xLb;X~QIRMt>sRBWxPB<GIe*bf{pm@*M|630GZmq<i?hx-8ef(@%g
zn~wq`mnf0$#lQxE;T2P+2Apew9R{|Q5csvas(N*mvASx(>Z)3QRh_@e*E8e(>HX8b
zGj+$5`zk781l7u72tm~4uc|bm+66nG!H#_U{-T|Rs#Lk1DUUkFGekb<L_U(w<Uf5^
zujm#9!cdJ&TuwZdQ=-KxTELo2lM$GVz+?m_BQP0($p}nFU@`)e5%^;xK=aG-4*uO5
z|4xm6cgDXn<9X@>hN_u=Z^rl!wL{GR$j0y2JkS042Y2*7MajHS(F+tQ{jTmJ&GS2o
z78CsYI(i49#N{8JDJF>@q{%Z9zc2W7hJQPDT9@PNgG`@jO3Uz+<l-slT*-}|w3-;w
zzjx#Lc|nWN`wyiViU|3Bg!~B&Fuq?q!hRb%aQIjt6W5PVHL^Vbr6@j7N&Y6Sj|EZ5
zvmbbEll)3uj_(bU=X$mI|1(nh{kYA8<@9odv{uuvYZ}vZlcx7+`dv*Q(^P3=9Qjo4
znl)=y8VfeGW;?B{;a%og?5SPkU7nS^_cDzLT^>$^I@^p*(R3!B>@*g8Jzh_pv7jW)
zw#TE?ZBxbj^fh=)qJDIUa|Qn{-O!~##+Cmv@CvLhoVK-x^awX*BU6@^AWt}7uUyoK
zj6+-^T6_xO>yh3l9T<&#-KbI_=slV2@b%AlpOinZbR9DOI5Ok4_s77c|N3{RJnmnf
zK>qjy_(bvf4dfkSwm2bIFTj+g@%*80Ar5hYu#dM-rJhUdG0F@eI}3qwh>J?sHJ5%B
z_&F3G-5|byd4bQy?voa6hr8!bHJ;Kq-+$_*9kx13uhaH->I+`3H=y`dZO71d_`VpH
z_$6YaPubx6M_k%bSo!(B3Gm&(jdDSqe-t+o11n2D-`K)^ke`il)vj@__h-OekmvEn
zS_grjQ&CmAuQNUf{5-5au|BPPQ2GVQtL&Kbkd|kV<v*18+%N1SnF9aG&y-JVXocs&
zQa`$MKi;d0XF~ow+~uzrD=sbsPV#%TJg-kJl=iCy|BjpQ(|%p=)&^w=tV~VFP<wgY
zu>ZGcoSzqr-!APqh09R-{CrFUA1^+6X=kon3nUG;^N_T2iP&QE=V9Q~Up}9*&*vwB
zk7xgRX{TE7Jb?S{=Mr~{kfAIveq7^MX`ILZuYhYvDd_WWNxtIZ(tVo?e-C^-f2QF9
zOxGP7<Mz(c_!?aaKSwXrcwFQB95ghZ)i~dF=K~+F-W3zzO~73kZ{0f1?Eei?9`|!S
zemD+4knNgP+SfK;E8GV8*_d_kyo=*`1mm2>bC<0jZ`bv<__#(zH|cu!>Kl?=!$r~E
z(mt*WZHMjO1DwVu$C2s%+Kx*PaJileJ5NaYxnhB?mgS$8_MPGqLlwYnnFh3djxWpq
zlH}3vcK_c3ZfM<d@|m<Vw=^$fJC(@L^~*Yp&q0R9$pgB8ucJ$W({+8!c3pcl&K0no
zfVQ(j+kqQ0T?>4k&X<!I*{Q&+k{hfG6WWg7PymL3Olghx8j9!pgt#%Bw!EGs_Rxe{
z<3TIbCIDotEPewJ7Nh%of>uW`Ognutkc=dQZHZ)SC=rZU$#f<d%65rxvLls<TG5E7
zc6<ri$rBHT(&^CVAijm9H;Y(0)DaCvvK<|pp;E@-gG?NIdCElFlNrmCNI)zV$^`F7
zgtDFCc33R?coT_kB8hM+i|t99@I5D(X^(fH5y9Zvb^hi+FmPQ1ZAxmm`8t1d;~KzN
z{|*LBU1~P06Txemu3znM3SPf<?Tvx;!S(*tO@W}G8euW0_9<cGQE&-uKMF2G1ENtM
zZylmdLM*EMlTuz0?@VQBkCJVFQ%MmUjpR0_5<r&zvu#$%tR*c*BOxm!WL$M%I<0sl
zxJ7T38f#AmPL_9R;jk5gbeEvm23xbSF))niW`~oh%|VL>NO|bhwk~vHM?BLJvcm0h
zV30&7+f@ph-6HPjq~7+zPWfi7l9Nkx1=NDgqII*SMjBeE_JrvjUnO}8(jD-EhEhD5
zf#Fz(6<jLMi(zZ*PPAYolMJ?pIwRPnrgo5(3}MSxu<?3S5{Y*Pvzcf_cH3fkUMkPa
z^&Yj73e~N4uZ@8)tRtnc$t`Z=AT^uZ3kF*=8Qm7SzYSwbw^??}=8|ehGLlV1<$ksp
z#<_);1udvTOIM5T9UA5-i+fbZn#!>+j|V($*c*2jp0Yv#X9H^Zu-9;U%pi!RlN}{Q
zJDEs^A~NRG!1DciV<STsv=mz+&)81q$!zYhLam@yTG4jSbtbK-r>!&VY0bv5`7<67
z9&G)Krc*+q?IG+M^h7py!h)h!T9xQGt1(OvW$9=lL<%~WN?2qO9(j=Uv?Y<TqFp%C
zEeb+Od0}~?xWmX>R68~mmN?-l`Iky3Q_-}wSwsHSjf?08>amo*g$`F0se75S&v}v|
z&rqFgMW`blhUZC(oB~4EwntjPt(CqWs~h+K@J#C*G&kYr!bJO~xf(n+#w7d}QqVJ(
z^`FuYDyF<n!b<#nW_~#`_*|W2eO^Cd>ehl>FV7!X--isXov=QyuP|-V71BJ2vaHYZ
zXA>B?%=)}u!<5%&AR?39{}!OM7Q*_x{=t;n&;GL<(_5iWYb(t2`U=yHvR1X8L^ABZ
zhD8Dyvd#LuzQfee4s-pi&+Xr&^;c_yyk5k#pmEZtv0~Tn21ZrrWuMXchn|P<IvPY|
zVn3L_8<}!_L+df!$%2~d;?gwdKAZktEx?rhWi!k(-C@%&Xnm%AJfXMM&-Na)>GOIb
zQ=b2^yuJN@rS-Z0grOxheMGZNxc$uk02zuO>+dmC2~+!B!e0O1*z~KvqV$;_CgWv^
z<Hpd>kSV`^*#AetU_njo`mfsbdEJVsUEl8iuc1#aPSFMJ^<a?r{|9)SD|GZdmHFQw
zLq4%S@2l=!q$K_;{b7hzk-icyty{6ay}isc{SbP1t1RjBI$58tpXIdq()GtU{dI|K
zzZT&2ft^}^0V`^1A3s%4rfZk|=XERo{|z2k{2s~mvYl^3x4eGd@4f#@RiV9p)@ORY
z&3|6s9@6?ZX}_7U9#g}n@AD~1U#+6-r(K`<`Oqg@+<ty9Y|;9)wgG3ice$=#i?dn5
z`pqRoWgsq7Ipu3t7fG^58%*hj8bIgb;CA!4q}vnwj%+EpR;X;3&3~?!1?hEtJbfOA
KOKl1^R{RHBb&-4k

literal 0
HcmV?d00001

diff --git a/examples/device_ptx/mandelbrot/mandelbrot_host b/examples/device_ptx/mandelbrot/mandelbrot_host
new file mode 100755
index 0000000000000000000000000000000000000000..3cfe464b4da21e488c5db43725b7e7117b5f5acd
GIT binary patch
literal 22520
zcmeHPe{@vUoxhVGFk;9I3K|h~N?&NS5HtKTh|~!r@X`(#HE3v&VaUvoER)PMnbANI
zV?&hb5Tz})Z0oviwbiqWE9<c}+FFMnl4#v7RNJL%*QHkXBxle@po%VK_Vc~>ekYUH
zr0wZBd-jj!o-^-zzrVltd%y3!`|ixVyKB5F@@zK2Enj>_5SDT(ibpd#ZqywBkEj$5
z9A}A7hzY<8Bu()>3V^GcnLiTRHJ!+$w@>q}aE6v+$Pp4Gy|LO}?SP@s;~>34Qqs5;
z7GA0>GGu<%Uxnfe3*4aNXPB!;xoKLe^qN+UgadgBKb};O^@vP*y;`qV>oF8sk0Ixi
zazdY-Ivy)Dz@x^HWT_FPk83_*qd-L#E)A{zJX+5RS)L*Ln+QA9A1f~<;8Ei=v)(Q5
z!~%awoON378Ew!Cw<&jmA^W=tdNdDzZgXnEo3+2O#<y3`A494yYUFn+J~VwRSFaJ5
zJ-F%3Ph2|nCeM{0%{c$S1J_PkzOL??aP!7R3$6(V=7yVF;~jH5Di_UNw4l5_Qa(?X
z!(`}^Jr(yo=m|kIYR7^<ZBf0^k1IgFXcX4kG3XDBK_~w%e5`FN2&375a18p+G3YZv
zcj04gG}_VZ-#iBWb7RowfbPP_+BSnQ%St5NS6jcz7YMcmH#N7%f^GGymWLy)!FvD3
za8USsn_42RzV?{EE#~tn(NP6MZNZ=jg~IXnCeh*#ha-*9#cVff-4=gyD=3W}eqX4$
z)gNx&8WhpC=GIt9v;<okqgzCKtgW#nD*PKGZ81p3!tK6I!I;k<2($?pYN0AwM7>7c
z8?A^|dTW<0_suJxFSzGkknPSN?#?S;NJGy<wNg3fp?V1x;;3+{L!0OYjm`@fJ{LDP
zqZA9ozoJjg2Hk%K<nw0FSHFR>EfoK*Y0pSJU%aL9`1Ixr$kV&PsdcVWkgrFNOi}nw
z3w?v8@3zpN(D+^peZR(;&-NeH*AM11eZv*XuUR)T9>#n1eUr{f$_=+v4xO(##Ou$Y
zvk#&l&7l_&LAT>M^vO&lsN<pQtu2Rc&!OvEh2%PO=<+_PbxLyR>bjyBt{gh~;8vDH
z$Ao2DMGk#ZmMTPL4*kL$`gnId1LGMO&%k&F#xpRUf&aS<oOWFCvDtOfZYB$#yFm!E
zI~B{z95uTR+Yd>nnT4-_oGE=3N5_>O^ayVv-r$K0TIoT;X^Am-RN~JQPD`D^l*FGV
zoR%7c`z8L5gwv8|@DYiBhj3bQ4DOcrw+N@D#$cz!zeYGMF$SX&p9MU$8@Ox~zx&<V
z#If~e;w`i5?V-B*8h6V5ve~or7z8rYUi*v?XF}zUD|bkF(AUC%*;9BkfyHNI(=hZO
zK!}EZLFR<x%1#nIq#@+5lJX15uX)vZGcjZyc<%=Dz}bA$cEmh(J~kBwy0rm&=0wPG
zW%aN=#n-v?9k50G(`(GGr5mYhCf<n^naQO`fDBxPVGRUuI8yimaNF(ho%K)pgLi_d
z<5>e$^PZC~B)qStpVHn}^Qr<zfdG4*fZ0>iXC`Wnn|(DWB;HpuK(Mc7h=h;SoI&@j
z<n$Caz^mD_s^3i5`)YP7W4jcPdG4#(qa<tYgU!9L`G~~(YMvn2SF>N)d<I=w^HX+<
zl+B&63GCq`AU;PdJOiW@?Q1!-p&JzK-W=MQ7Zq)5E{)`RcVc0jsOfa4cBJAZX_2hy
z+?*;%bZ?_9?M{91;=yZCe(qGFZ{WwjM{3vNkd)j|;sAoV5C%gDyW?2_2``=x%${lQ
zyabTQ>!mt06Wtz|>ZKY+-=sbb5d9|f0qTh4RHO}x)F8pH%|uUy?Dd;nsT4%1X3fN5
zGkKkv_-`|D%IrEgWF}rW6Wal9f<z9nZ^h7RY4j#c#Yj4BgF*hNmENJ~$1#NKFoaVM
zXWl8t=SP}7?U=HHoSlHs%s(YG03q?GgoezX$-g-)-Px48Uz=SA;Nw;JvY839((Ebx
z_;Q)DN=iV_Qe!c&o`q*X9H>O<%;fha2UD<Ep_Fqo`5>j+O!iQUy~%D$t2fz0N%kfm
zRO#<ClPaxUC!OBJRKrYE8}@;AveI?f3154ckgviih|o+{pm!O1G~EXWdK{AV0WNN4
zvU1>Ih5i<3s1K=37t6A_?|)D>vv7d!Ys5>UV7jkH5UTdN=b-CO?YHenIl8|Cy}_qx
z9(opDg5i04CSOiN?<sUiu=G;^&(p*{N3&JcE9Vli_Z1jw`VjtpF>vr>_>%)D401e!
zYfzA|v|0vX4|N@|0rL=+un(?B6hPgYndm74A=LGeaQs~=bk9fk*&PobaP>c7i+?tK
zcT&cj=}&YU5Tet>nJ$%s{TBQVEW{y{ltWDRkVI&4|Mo`*>Cz;H;w0phLedM|&btnw
zIbJ-LEZ7b3vO2e(JSam0`%&~9FN&_S9d|j?*GMB3WaK9w$)LB9Akw-ym5<cU(lKpJ
zZy%hDKJ*)v{wc@RDOC03z-3f!>*@UcA~>fUQ>fVKGx;XTT1%|;7S?CL3OULTIPOc}
z3?BR@4xwZZc^gjCH!tHfIn&E!oI4TXz$bA)Mo1RJ>xXgw7jUgB*oir$Df!`n+yS`s
z0G#RB(nJlJIB`~*AYLew!Z5GMO3jf{SCZ88QmTh|e=Mle5F5B(StMB<)T^J*4Qlp?
zpk`#<eM-99PbNO6Ob{>BmC8d@)3ahKssxD{Qmlt~qenGYn!KH4&wdD@q?C2194{V4
zb`~Nms*1$aor71O*F{u8!asp9X07m1Y`8;kUrJ3Ty3Obgmqdw*D(Xtplo2+193dIU
z&tY{SD&y!TSu6l#Ad~m4m2PUJn>iJU1z*Lr>l%qUaqV(W{v_f!<!CrS{fV-_7JXS=
zcE@cgs-8QY=_{m{2zi-=1BQ893L(GuVmA7JLU%a&OLX+pDf({0R0T{`bew!D*E7jt
z?*}q}#QS4od>>*1yQFM4$!5pc-yp~6m%;EDKPNFY#+8>0k8!!|k3GgDd6|s^k8#rQ
z7&AX6JL;_a%`}V`$+_cVn$4dBLq6V=0#m5_ZFGn8agiE#^m3UG!gTRD6(gIFB69!u
z?_@%VH+n+m%jq})v4JlQ%fjpJGJrX_3lu(m8$!}0_&BcRQsqffF>!!E<^>gDqI(h?
z_2Mcm65U1Usw;ckO_u{Raq{*<5dN7I<}2`F!V>oMbKv}d`ts&63sX&|l09UFuAnE<
zFG#nG$nCK+((S!jw|V52_8Sx31;cJ@(_g?ntPlysP2oxCq&wlIZiN43+W&rVD$&P<
zc5`ZVqWhdY)#$zyH_}xIhg>9k$U22{Ed5;>&f~B;undQ!)W8|>{*TDW1-AY~_gUCb
z_lH_3yp@EfNnv8C>->w5M*`sf`_xO?Ly4Xbh&#&N$C<{89QS!Q8G7|JgpyJg%J1RZ
zBh1mrdq{KEwogW-?;+PpBPXFha32oP?g0CU@SxX<-d9j{OMh}P#_@w8WUd`DgNG0j
z*q4$;d9#^_>)jbOz~~HFi-he>bPoaa((SnGtcW|)mD1VW<ZL<)Ny(S@@4x-<Bb9>Z
zf5<7=Pra^6vCWyDF7;ND-dm@n@2?Y+$^&uSrvxW#F9NbvF(weBmHG%U(fuCc2qA{p
z#PpOwdGqNh+zsv#;Y$A&6RP>PIGMOpGRo7x7-r;2#(K%PF8#tVBVRIVBx7#+v0+Am
zWXzF_rRn4_qcHA(kBcSa(}NMJp{)N2lK0{B6xh{+x5(<n6tBjZ2L|7VBtLJ}nTfMz
z5|5XQ&7^5hx^Z+Sr{GwUyuh3ExK3B$dV)?Fb?oR=P{(N|J(Z^&&XNJz>uxxd^ZeFz
zvcpVlGfEOM<N8Fs;YoN67gn9shJmBkD8sSdn1kawqat7?3rb-CDefGgGeKafSeZ*j
zNh<1sT<TGxdW{8U;$RXA$%`CMy@FGt+KA;!on&1zmx_CwvRIu<eVwR?IEjd1x!O$h
z!LpgW!SU3;Nz2j2F(c-9D7CoWh&mop%eKP7w-GzUuSZq&XZpJa^G;(qguG0y#cEfM
z*NJ1bG4-w@P^%4%LB%8SsgfHQp$ze&5#mY}DRX*_3LNVxmCkjPN@u_TcS^w6aN4o%
zKj1uI#7;Zb{z`HJryc9jYj_P3^~O3#^g{yiL*f-_2U_dVYd9doXy~gp>S;D`@05f(
zcWN(^+LWG%>ulHIjzfss1x2q>4-+q=S8ddx2k8cwb{5`-<vOkAzMfj_HJUh0dFfUu
zN=gugIw?VY9sB}AWI6(`#3AP_bix#*AhA|DX{a_DR8Zhn8<0ev>M=)6q33M4G`=in
zscCR6Cej4Cgr>#ul-G#ndyN>1FbqN%)7xY=%2f|7o$(xxXLWi==8~;L$<{;Kz!Ia{
zD8mss$Fbh1#BrVB$t{{}$$E3C*^(VmcDiIuRkB_#*&W7X@Sdxlt=e_D)NIu@<WhCj
z)?rxn7+!;_+I3vDJB&wQeDO2V2i5K#N#J=Y96SRy^y+l|@mjtgUsTDF+i^tE-c^Jp
z1RjDEhUC^u+zEV#u^Top1Gq^Z%)9~%2N(B)USQmPVOFUC*@Qkah}rhwSWlTn1tYUg
znd-+3TDG4;4a34?GTu`tV?0niBpqM@_!_M4F*=nPTwJXN12K4Aze^jSbs|FUKncrE
zlsipS%R7v{&`(c7Ip1@brkyI9f<u|WkUS#LCiLa80e+l?2XKv(4=E_?UD5()FfZK>
zfoiENk1#~a^0*G6H0&@s;YNnCfo$CcThcoDQ$7%_<U<jrAcgp&NfqjFZ7Y1xn^WF(
zj;D6X3C-UnGnUw8JOpP+4fe=Tb+yy7NA7!>-5JLN*P_VWDQshTu&b7F>`9s3A2{|L
zupcnHf97~NW$$<FD1Zp<bB)D+JFWPGNpwOv*Z;)XL!lgeP1S8xw^psGs=uXfq^bC-
z+QfNpa+7ma_uJd6Y7-x>N|w9daKD?L2K5|!nU`u8pN!3_O>TRaf{DH7O`IRN6;8WT
z@waM|D{$+YF_6x4zhNeh)3)drs@7HAUbVi;N85BBG3+0^sFhy28<{winQ~(yW;iLT
z9e1OES0#>^D2HgEI<cUXPFK2&xB`iSQaa57iK5cun&>3ruOw1nQ2&Hxlo9eaA>Ap*
z{a3@NnfSdoxy|S>ac#9@azuJI4KHVWaA`L)2Pi1Jxdf3pcDxTcv+H_8Z2vXsl}NqI
z)V?K!Sfb@i%q2xe>;<!Ho8c7O-*KnX&fOGrg_%j2OK{@Gj?vyr1w}<yC3#&Wgsvup
z*a<mD2uwu9_JcL<x6_x7h_5zt*t;ZPL}TA08JF1pUJd5x;_UjNHuI8qN!W13ezB_i
zgYECT-;l!_9$z3aqm-_b$m*O@y0)QXln$ZeO=L1ZH#3K77rz<*W%^mnU1iSPW%s=?
zrPx0iZ{ryl&%k&F#xpRUf$<ECXW*|U12*}q#LgS=i%Jg$O&_%N9tjjph!YWwpTJ%)
zwE+!xdn77`pij-|kpOmD{H=jtcw<{6<_pcAht+QucZoT!7JrAY8NZfYDqLluMGIV;
zgKe$B@KvI;y);nTh!4T`Nusn-Ou`RhN1770%k!7eR$VbZ;s45H_5q#)q;I~9uvVsT
ztIq+R1l;#-Cga3H`5fRJz<pR6dI2B87G0Qdz)nEgm)rwL`-@Kio&)?5U=bF!M*(*M
zmgK45Xv3_qZM|98I-Iu4izeE4gFh9t8TdpI$7~X)o`@|P^?UmTfW2>KGV~}bik&Np
zOFrwk@J@TDxMBK|tLGcDh)nwH@Nq$3F64=b=E5h2j~(NA91nM*s@QpN-tx&4@}n2P
zbyd*>r=SHqJ$UZH`ga8&$>dcROEVPHGx+p__ax#M)y18*{NuJ_dlfn==GQ@P_$@Y?
ziNCUVzb*em$)~UNX%J`dDZ|blxu-@nJ9q|YGbKgQW`I^<p;dr(J!tqIQ#R7A2CWjb
z0!^z1YyfQmXxoT}_?H!zEGu>`D=u4BTv1hASyk-mnz(C1PvQLq59HsQmlq;Zg=9kr
zqRy&fy9w1tU?&Va@*Dd<v{PLi&9#&N_qNdtQ4G3Kj1)7)Pru90>l7)X!HsOpC!F}y
z^6{Zmj<@j)jAvjx1LGMO&%k&F#xpRUfxq4i(E4z+5C3nC|EI?PJLCVE@j8?Lf5!he
zWBOw{rHs>m>Zmc^t5Csfce$7ajsGilRz9o<fgj4r#{a9kT;u#KGF{{R|2lfEpvL^)
zozFxF@2ABt8n5>~x-&tM#gOLnf51M{bpG!bJ<CwT&tFC~ttJL;RAgRXr?k{m4e5vC
z)bL9#*^X-h(|dIw9Jfm+ivO#|{_%q$IUX31qWD>c#2d7JM&ld@9xEmPI?cx~`y|eO
zbvXZTAp8GuO9YGP;RbDmhPP@M((n!q@73@d8h%ehrH${C>Qry#^5xgL%GPX*x5nZw
z_rmgd<rQ<?i{cV@U!^HR^Bcqd)=jQEf^F^1kyh6{ce%T~(p8p~X4|7dnzjP*e#Xk9
zd9;a(1^*x2r3LW2pi$+o06h<9!;ob>q(^j@{{JXY+34@?MCbEWlSM5$HZfDw={)fH
zM^A({R3o1^EKkq5WQWgxrtg&Umt_Cn%Jj$48Ew2zfiB~rfuOBWK)eb#n*3{H(8tQp
zJCL`DDWZf1ZfT?0KMffht|r!cJ12FWg8zTdrW{C=O<bNmubEy7`bEf3ouMdv{hAN@
z6uf<+vqI5n4~W`gDX;pu$Iogzo>E1X`vT~$)pnY+9k%b2^qFFVo=3h;1jn%d<uT~A
zm*mO~MEAdi3x*3jMILRRMerj&Io|zx9_3yQXtW121$EVAiRW2OFZrdiz;V6+x)b)f
z4%z-;&@aj>7JR*D`mNedrM3f9wpTRWWwC!0^h>Z?#r|mjgW7+Bev#^qI1XufCbInR
zC4HLE(~DYF?Ly>-#{2Pch4b~*fzBnkE6mYyEI$);s*?@c5bsCLm3B%5|L>cx>(!Dz
zQ=GHdUn%WR&E6;3{;itM_ZOxIr5%T8(DmccakYRxS|09|cBaWaLK0;=Uz2ubiaecX
zpt8~XA!r`0b@E-%N3&1=R>4^+c-_GBb41b|;$a;p-$!2s-KBlyf;Xi6Oi^Z;_Y*|V
zn~}ZFv(CGqj}}ix+Cf&dUmkBU0-*CZrs;g%O$B{4JD(bZejVse)QLyO$?@DE<#Am#
z6ouCz56d_wXWwfxTPt`WPkG~Y7w3)EH&llemORj3u84ltQa=sS&P-vq$gkINTF=M7
z(elwc6^Lt=f>F?^p7-nfj@<JFE3V~-v|o;Mi<aN;38gRhaM7jrzfhb`OFZ|3PUkPr
z2g7e`J1Lzvx<65SR?1HkWeUyey(H~B#3ntDNQ-R0ko1}2yd^)sly;`_@-u5^faFnE
z*8H3S-KA~jf?@<r*MSmE<oKs+I_offHab*qyng2M)D1f2xzkXF_`0!N)AwsSP}y!n
zhwR+1$LrMe+a-Ob*xjZG8$t@+t?5GNALS>TA9Ov{nxeoq&@alvs+bLH?w6$<hxiBW
zm+gpK8rx#-@(AAb;osQoi}^PR5ZYsL{Pw3&48M=W7i;k~(i=(IAsLAHHiaV_{b64q
z7HMnu`Qsg;G13wZ2V=oNdBx}w^v029pTDimzr`19jkRqNp*DX@&=-iev}}P&4(5X-
z-Z_#Z+7xMzm50L+i~8Gro5TKiYh#no@{LX)cn8TeM&o#q${o#(L0@}Qa|?#y^R2kK
zYE_M|W_2~aKBao?>Z(<>%R$CYyRXL7T(kOS;almwaaonucjJl`x75`8>Z_J{YkY!y
z!lF;TQ3bC^@y$o@!Jvm<Z$d9JVIdV8y-3C97R{~EIK4;3@(z})B3_0fU$v4YkflHN
zDws$-Mi#gDTIfwK!GJ&J7c#R%@^|P2wZ@tQzODMzE+cKr1WG;&H^%&s?hws^kre3b
zaW_VyTYNF9k=*R7H_%`nqpa7;rj(#Yw~EcJG|_J8=e|!S8`peon1bFSHg1WjBE!(s
zYismNX|nQ^qs{P6Rn#18hv865%r_5KM!Y7cHPY4sTA)4RYr<E5c#n>H$4%A_UP|Mu
zy%9bF&8@z8doUm;ZJvH}PF9TEsdws(Afbo@*$l}S?@&xMQi`ZUO^|P6d%GS4eNWjA
z+n6`h(ui*(7cC(9k+X-i>g=~f0`YKAzJ@1+vX9{zM<-0Bn@%1rOEn*jbqKu3XT%f`
ziZ=@J2|%wJ8uk4_N{Qxyr&JXZj|>CjsANrq+9EAkijFfJ@dxCDxJBEhjc}a%wxN+3
zq|;oA;-R3teM?KszY#FjreG8IS|hPw`KH!*`NnuNURu>05aoDfQLrs4B(=$pw;h!S
zwzR^6g0VKm(NBaU5FdDL!LXkcbT1l?kwwH(j&Av;2s*J~2aa^ff>1=BX63;qeUWOy
z%aF2|D9^?hZHq*MZLuwyRJE*jE?rQ|V@f}UVo;9MwN2UQKFLsIa3@FMZ)t8s<dGN!
z1qz+%<<bH!wJoUq5!d~{6rueY+FRiJ0p0JZF*FpKJ>oVpqLl-qdp7Iu)A|f){gd1H
z{?4>T=-{tf5!Ub5`V84W`{nft>wD0lJs;NReIbT*+99pqsLT3%-#0-cm#okGObkB)
zFSl9auLG0za9E%BZ5Z<SIewO7csulIPl$2e7h<?UJE!<bhU3>{5k`k>vp(-jF?8vG
z*+1*^`0vpA%d|n>=VF-Bbke6<vFdk%MlSSr)$qDQuRnPI4<fR09E^Vnom_pF)?>Jv
z1vS*{?DFPbi~fEsz>wo*GmJBQz@neh`V1@aH^<z@{+ae5I=S)lb1_5SA7y##_`j|7
zdH(o0hvCz#s3FJA<VP*~yf4C#*WWB}_5Uvx{R+2|WO!2759?d~{}7Da>&G9w|B)3n
zwCexVqR;zn3|$s|Yy7_gpF+&njHXpef!~|Rb<Sq2`o95#Vq$&Xx9(i3H2#$SAjD|z
zNME+H`)#am9WUbye+ON9mSBC}-`lDEvm8WZW1Jzqa#fCB^Ld|Qx7IJybOzS?Erv3k
zyX>F$+xWeqd|~0|U-rv(z5?A`|NNf1Uj0ihtAEyKc&R0R-j^TH`VAbohOEcXWzpw-
z4v$AEaGY$0^%$QGeWLOB`T4QVqa=oQvZR7md!N?+D@KY5*4O2j<w9JxT$nxgvr<Al
z;!#M6{zAb8l)lU0__;2L&9S4KZN0RSjJDHK@a$J{=~;Aiea~tovB08WLB;<9anD}2

literal 0
HcmV?d00001

diff --git a/runtime/cpu_device_shim.c b/runtime/cpu_device_shim.c
index 2f1cb59..053bb6d 100644
--- a/runtime/cpu_device_shim.c
+++ b/runtime/cpu_device_shim.c
@@ -16,9 +16,24 @@
  *   DEVFREE(dev)           -> pas_dev_free(dev)
  */
 
+#include <stdint.h>
 #include <stdlib.h>
 #include <string.h>
 
+/* Thread-local index registers.  The compiler emits THREADIDX_X / BLOCKIDX_X
+ * etc. as loads from these symbols (declared 'external thread_local global i32'
+ * in the device LLVM IR).  pas_dev_launch sets them before each thunk call so
+ * the kernel body sees the correct indices -- the same values a GPU provides
+ * via hardware special registers.  _Thread_local storage makes the design
+ * naturally OpenMP-parallelisable: each OS thread gets its own set. */
+/* Thread indices and block indices start at 0 (first and only thread/block). */
+_Thread_local int32_t __pas_tid_x   = 0, __pas_tid_y   = 0, __pas_tid_z   = 0;
+_Thread_local int32_t __pas_ctaid_x = 0, __pas_ctaid_y = 0, __pas_ctaid_z = 0;
+/* Dimension counts default to 1: a unit grid so stride = BLOCKDIM*GRIDDIM = 1.
+ * pas_dev_launch overrides these before the first thunk call. */
+_Thread_local int32_t __pas_ntid_x  = 1, __pas_ntid_y  = 1, __pas_ntid_z  = 1;
+_Thread_local int32_t __pas_nctaid_x= 1, __pas_nctaid_y= 1, __pas_nctaid_z= 1;
+
 /* Allocate n bytes of "device" memory; returns an opaque handle the host must
  * not dereference (the dereferenceability invariant). On the CPU device the
  * handle happens to be a real heap pointer, but Pascal code only ever hands it
@@ -94,17 +109,37 @@ void *pas_dev_module_get_function(void *module, const char *name) {
     return 0;
 }
 
-/* Launch a resolved entry.  CPU device: the entry is the dispatch thunk; call
- * it once with the marshalled argument array.  Geometry is unused on the CPU
- * device (BLOCKDIM_X/GRIDDIM_X lower to 1, so a single-thread grid is correct);
- * it carries the same six values cuLaunchKernel consumes. */
+/* Launch a resolved entry.  CPU device: the entry is the dispatch thunk.
+ * We emulate the GPU by iterating over every block (gx*gy*gz) and every thread
+ * within each block (bx*by*bz), setting the thread-local index registers before
+ * each call so the kernel body sees the correct THREADIDX_x/BLOCKIDX_x values.
+ * BLOCKDIM_x/GRIDDIM_x are constant for the whole launch and are set once.
+ *
+ * Loop order matches CUDA's row-major convention: x is the fastest-varying
+ * thread index, z the slowest, mirroring the hardware warp layout. */
 typedef void (*pas_klaunch_fn)(void **);
 void pas_dev_launch(void *entry,
                     long long gx, long long gy, long long gz,
                     long long bx, long long by, long long bz,
                     void **argv) {
-    (void)gx; (void)gy; (void)gz;
-    (void)bx; (void)by; (void)bz;
-    if (entry)
-        ((pas_klaunch_fn)entry)(argv);
+    if (!entry) return;
+    pas_klaunch_fn fn = (pas_klaunch_fn)entry;
+    /* Block and grid dimensions are constant across the launch. */
+    __pas_ntid_x  = (int32_t)bx; __pas_ntid_y  = (int32_t)by; __pas_ntid_z  = (int32_t)bz;
+    __pas_nctaid_x= (int32_t)gx; __pas_nctaid_y= (int32_t)gy; __pas_nctaid_z= (int32_t)gz;
+    for (long long gz_i = 0; gz_i < gz; gz_i++)
+    for (long long gy_i = 0; gy_i < gy; gy_i++)
+    for (long long gx_i = 0; gx_i < gx; gx_i++) {
+        __pas_ctaid_x = (int32_t)gx_i;
+        __pas_ctaid_y = (int32_t)gy_i;
+        __pas_ctaid_z = (int32_t)gz_i;
+        for (long long bz_i = 0; bz_i < bz; bz_i++)
+        for (long long by_i = 0; by_i < by; by_i++)
+        for (long long bx_i = 0; bx_i < bx; bx_i++) {
+            __pas_tid_x = (int32_t)bx_i;
+            __pas_tid_y = (int32_t)by_i;
+            __pas_tid_z = (int32_t)bz_i;
+            fn(argv);
+        }
+    }
 }
diff --git a/src/pascal1981/codegen/exprs.py b/src/pascal1981/codegen/exprs.py
index fed5144..f93dec7 100644
--- a/src/pascal1981/codegen/exprs.py
+++ b/src/pascal1981/codegen/exprs.py
@@ -727,18 +727,48 @@ def _to_i16(v: ir.Value) -> ir.Value:
     # Built-in Functions
     # ========================================================================
 
+    # Mapping from Pascal builtin name to the thread-local global the CPU shim
+    # defines and pas_dev_launch sets before each kernel invocation.
+    _CPU_TLS_GLOBALS = {
+        'THREADIDX_X': '__pas_tid_x',
+        'THREADIDX_Y': '__pas_tid_y',
+        'THREADIDX_Z': '__pas_tid_z',
+        'BLOCKIDX_X':  '__pas_ctaid_x',
+        'BLOCKIDX_Y':  '__pas_ctaid_y',
+        'BLOCKIDX_Z':  '__pas_ctaid_z',
+        'BLOCKDIM_X':  '__pas_ntid_x',
+        'BLOCKDIM_Y':  '__pas_ntid_y',
+        'BLOCKDIM_Z':  '__pas_ntid_z',
+        'GRIDDIM_X':   '__pas_nctaid_x',
+        'GRIDDIM_Y':   '__pas_nctaid_y',
+        'GRIDDIM_Z':   '__pas_nctaid_z',
+    }
+
     def codegen_device_index_builtin(self, name: str) -> ir.Value:
         """Lower DEVICE thread/block index reads.
 
-        On the CPU-device stand-in, DEVICE code executes as a one-thread,
-        one-block grid.  On NVPTX, lower to the corresponding special-register
-        read intrinsic.  AMDGPU dimension plumbing is deferred; keep it
-        deterministic rather than inventing a half-wrong dispatch-ptr decode.
+        On the CPU-device stand-in, lower each builtin to a load from a
+        thread-local global variable defined and maintained by the CPU shim's
+        ``pas_dev_launch`` loop.  This lets the shim drive every thread in the
+        launch geometry and have the kernel see the correct index on each
+        invocation -- the same semantic a GPU provides via hardware registers.
+
+        On NVPTX, lower to the corresponding special-register read intrinsic.
+        AMDGPU dimension plumbing is deferred; keep it deterministic rather
+        than inventing a half-wrong dispatch-ptr decode.
         """
         upper = name.upper()
         if not _is_gpu_triple(self.device_triple):
-            value = 1 if upper.startswith(('BLOCKDIM_', 'GRIDDIM_')) else 0
-            return ir.Constant(ir.IntType(32), value)
+            # Emit a load from the thread-local global the CPU shim sets.
+            tls_name = self._CPU_TLS_GLOBALS[upper]
+            i32 = ir.IntType(32)
+            try:
+                gv = self.module.get_global(tls_name)
+            except KeyError:
+                gv = ir.GlobalVariable(self.module, i32, tls_name)
+                gv.storage_class = 'thread_local'
+                # linkage stays 'external' (default) — defined in cpu_device_shim.c
+            return self.builder.load(gv)
         if self.device_triple.startswith('nvptx'):
             nvptx_map = {
                 'THREADIDX_X': 'llvm.nvvm.read.ptx.sreg.tid.x',
diff --git a/tests/integration/test_device_mandelbrot_x86.py b/tests/integration/test_device_mandelbrot_x86.py
index 87880c9..81010c9 100644
--- a/tests/integration/test_device_mandelbrot_x86.py
+++ b/tests/integration/test_device_mandelbrot_x86.py
@@ -146,12 +146,14 @@ def test_mandelbrot_runs_on_x86_cpu_device(self):
             with open(harness_path, 'w') as f:
                 f.write(_HARNESS_C)
 
-            # 3. Link and run. No Pascal runtime needed: the kernel is
-            #    self-contained (no host I/O, no externs on the CPU-device
-            #    path).
+            # 3. Link and run. The kernel IR now references the thread-local
+            #    index globals (__pas_tid_x etc.) defined in cpu_device_shim.c,
+            #    so link that in too (no other Pascal runtime needed).
+            shim_path = os.path.join(
+                os.path.dirname(__file__), '..', '..', 'runtime', 'cpu_device_shim.c')
             exe_path = os.path.join(tmpdir, 'mandelbrot_x86')
             link = subprocess.run(
-                ['clang', ir_path, harness_path, '-o', exe_path],
+                ['clang', ir_path, harness_path, shim_path, '-o', exe_path],
                 capture_output=True, text=True)
             self.assertEqual(link.returncode, 0, msg=link.stderr)
 
diff --git a/tests/test_device_index_intrinsics.py b/tests/test_device_index_intrinsics.py
index 3a8fdb0..21270ba 100644
--- a/tests/test_device_index_intrinsics.py
+++ b/tests/test_device_index_intrinsics.py
@@ -53,11 +53,15 @@ def _compile(self, src, device_triple='x86_64-pc-linux-gnu'):
         ast = parse_source(src)
         return compile_to_llvm(ast, device_triple=device_triple)
 
-    def test_cpu_device_lowers_reads_to_one_thread_grid_constants(self):
+    def test_cpu_device_lowers_reads_to_tls_globals(self):
         ir = self._compile(DEVICE_SRC)
         self.assertNotIn('llvm.nvvm.read.ptx.sreg', ir)
-        self.assertIn('mul i32 0, 1', ir)
-        self.assertIn('add i32 %".3", 1', ir)
+        # Each builtin lowers to a load from a thread-local global so that
+        # pas_dev_launch can set the correct index before each thunk call.
+        self.assertIn('thread_local global i32', ir)
+        self.assertIn('@"__pas_tid_x"', ir)
+        self.assertIn('@"__pas_ntid_x"', ir)
+        self.assertIn('@"__pas_ctaid_x"', ir)
 
     def test_nvptx_lowers_all_reads_to_special_register_intrinsics(self):
         ir = self._compile(ALL_INDEX_READS_SRC, device_triple='nvptx64-nvidia-cuda')

From ab275ee0b28e8a3f3d8be82130b7753367d1345e Mon Sep 17 00:00:00 2001
From: Dixie Flatline a/k/a McCoy Pauley <ubuntu@localhost>
Date: Sat, 27 Jun 2026 04:48:29 +0000
Subject: [PATCH 06/10] Move completed plan docs to docs/old/

- device-build-cleanup-plan.md: fully implemented (commit 47ba728)
- CPU_DEVICE_TODO.md: CPU device now emulates full GPU launch geometry (commit 7713a86)
---
 {examples/device_ptx => docs/old}/CPU_DEVICE_TODO.md | 0
 docs/{ => old}/device-build-cleanup-plan.md          | 0
 2 files changed, 0 insertions(+), 0 deletions(-)
 rename {examples/device_ptx => docs/old}/CPU_DEVICE_TODO.md (100%)
 rename docs/{ => old}/device-build-cleanup-plan.md (100%)

diff --git a/examples/device_ptx/CPU_DEVICE_TODO.md b/docs/old/CPU_DEVICE_TODO.md
similarity index 100%
rename from examples/device_ptx/CPU_DEVICE_TODO.md
rename to docs/old/CPU_DEVICE_TODO.md
diff --git a/docs/device-build-cleanup-plan.md b/docs/old/device-build-cleanup-plan.md
similarity index 100%
rename from docs/device-build-cleanup-plan.md
rename to docs/old/device-build-cleanup-plan.md

From 84fa8c1f68aa90c449508207229b42b6a9b03345 Mon Sep 17 00:00:00 2001
From: Dixie Flatline a/k/a McCoy Pauley <ubuntu@localhost>
Date: Sat, 27 Jun 2026 04:52:33 +0000
Subject: [PATCH 07/10] Document CPU device emulation in example READMEs and
 code docs

The two device_ptx example READMEs and the CUDA prescription doc still
described DEVICE=cpu as 'not yet wired' / a single-thread grid. The CPU
shim now emulates the full launch geometry, so update both READMEs with
CPU + CUDA build/run instructions and prerequisites, fix the
device-example.mk header comment, and update the now-stale
'single-thread grid' language in cuda-kernel-prescription.md and the
codegen docstrings to match the TLS-index-register emulation.
---
 docs/cuda-kernel-prescription.md           | 24 ++++++++++++++--------
 examples/device_ptx/device-example.mk      |  6 +++++-
 examples/device_ptx/fill_indices/README.md | 22 +++++++++++++-------
 examples/device_ptx/mandelbrot/README.md   | 17 +++++++++++----
 runtime/cpu_device_shim.c                  |  4 ++--
 src/pascal1981/codegen/stmts.py            | 14 ++++++++-----
 6 files changed, 60 insertions(+), 27 deletions(-)

diff --git a/docs/cuda-kernel-prescription.md b/docs/cuda-kernel-prescription.md
index 7776b45..9a636e4 100644
--- a/docs/cuda-kernel-prescription.md
+++ b/docs/cuda-kernel-prescription.md
@@ -529,8 +529,13 @@ coerced to the kernel's parameter ABI — exactly what `cuLaunchKernel` consumes
 `pas_dev_launch(name, thunk, gx,gy,gz, bx,by,bz, argv)`. Geometry is supplied as 2 values
 (grid, block → a 1-D launch) or 6 (gx,gy,gz, bx,by,bz); the count is implied by the kernel's
 arity. On the CPU device `pas_dev_launch` invokes a compiler-emitted per-kernel dispatch
-thunk `__pas_klaunch_<name>(void** argv)` that unpacks `argv` and calls the kernel as a
-single-thread grid, so its grid-stride loop covers the whole buffer. The kernel-name string
+thunk `__pas_klaunch_<name>(void** argv)` that unpacks `argv` and calls the kernel.
+The CPU shim emulates the full launch: `pas_dev_launch` loops over the entire
+`gx*gy*gz x bx*by*bz` grid, setting thread-local index registers (`__pas_tid_x`
+etc., which `THREADIDX_*`/`BLOCKIDX_*` lower to on the CPU triple) before each
+call, so the kernel sees the correct indices — the same semantic a GPU provides.
+A grid-stride kernel covers the whole buffer on a single-thread grid, but the
+emulation now covers one-thread-per-element kernels too. The kernel-name string
 and the geometry ride along unused on the CPU device — they are precisely what the CUDA shim
 will consume. So running the *same* Pascal program on a GPU is now a pure runtime-library
 swap: replace the four `cpu_device_shim.c` functions with CUDA Driver API wrappers and let
@@ -589,13 +594,16 @@ currently gets away without it.
   (A3) for the no-host-symbols invariant.
 - §3 (entry points): on `device=x86` the kernel calling convention is inert/ignored - kernel
   *logic* still runs serially, so you can test kernel *correctness* on CPU before you have a GPU.
-- §4 (intrinsics): provide CPU-device lowerings - `THREADIDX_X`→0, `BLOCKDIM_X`→1,
-  `SYNCTHREADS`→no-op - so a kernel run on the CPU executes as a single-thread grid and
-  produces the right scalar answer. This lets you validate kernel math with zero GPU.
+- §4 (intrinsics): provide CPU-device lowerings - `THREADIDX_X`/`BLOCKIDX_X` lower to
+  loads from thread-local globals (`__pas_tid_x` etc.) that `pas_dev_launch` sets before each
+  kernel call; `BLOCKDIM_X`/`GRIDDIM_X` likewise. So a kernel run on the CPU executes across
+  the *full* launch geometry and produces the right answer. This lets you validate kernel
+  math with zero GPU.
 - §5 (orchestration): a CPU-device shim where `DEVALLOC`=`malloc`, copies=`memcpy`, and
-  `LAUNCH` marshals a `void**` and calls `pas_dev_launch`, which runs a per-kernel dispatch
-  thunk (single-thread grid). Same Pascal program, no GPU. Then swap the shim for the CUDA one
-  — the launch call site is already GPU-shaped, so only the runtime library changes.
+  `LAUNCH` marshals a `void**` and calls `pas_dev_launch`, which loops over the full grid,
+  setting thread-local index registers before each per-kernel dispatch-thunk call. Same
+  Pascal program, no GPU. Then swap the shim for the CUDA one — the launch call site is
+  already GPU-shaped, so only the runtime library changes.
 
 This is the CPU-device dividend the design designed for; lean on it.
 
diff --git a/examples/device_ptx/device-example.mk b/examples/device_ptx/device-example.mk
index 6d071b6..025224e 100644
--- a/examples/device_ptx/device-example.mk
+++ b/examples/device_ptx/device-example.mk
@@ -6,13 +6,17 @@
 # DEVICE variable. The host Pascal is identical for both devices; only the build
 # differs -- which is the whole point of the shim design.
 #
-#   make                 # DEVICE=cpu  (CPU stand-in -- see CPU_DEVICE_TODO.md)
+#   make                 # DEVICE=cpu  (CPU stand-in -- emulates the full grid)
 #   make DEVICE=cuda      # real GPU via the CUDA Driver API shim + embedded PTX
 #   make run [DEVICE=...] # build, then run
 #   make clean
 #
 # The including Makefile sets: DEVICE_UNIT, HOST_SRC, EXE, FEATURES
 # (and may override SM or CUDA_HOME).
+#
+# On the CPU device the shim emulates a full GPU launch: pas_dev_launch loops
+# over the whole gx*gy*gz x bx*by*bz geometry, setting thread-local index
+# registers (__pas_tid_x etc.) before each kernel call. See runtime/cpu_device_shim.c.
 
 DEVICE ?= cpu
 
diff --git a/examples/device_ptx/fill_indices/README.md b/examples/device_ptx/fill_indices/README.md
index e4dca92..236d9d9 100644
--- a/examples/device_ptx/fill_indices/README.md
+++ b/examples/device_ptx/fill_indices/README.md
@@ -30,19 +30,27 @@ has an NVPTX backend.
 
 ```bash
 cd examples/device_ptx/fill_indices
-make DEVICE=cuda run     # build the host + device, run on the GPU
+make DEVICE=cpu run      # no GPU needed
+make DEVICE=cuda run     # real GPU
 ```
 
 `DEVICE` selects the device-orchestration runtime shim at build time:
 
 - `DEVICE=cuda` — the real GPU path (CUDA Driver API shim + embedded PTX). Needs
   the CUDA toolkit headers, `-lcuda`, and an NVIDIA device.
-- `DEVICE=cpu` (the default) — the CPU-device stand-in, **not yet wired for this
-  example**; see [`../CPU_DEVICE_TODO.md`](../CPU_DEVICE_TODO.md). The host
-  orchestration already works on the CPU shim; what it needs is a grid-stride
-  kernel, which is a deferred kernel change.
-
-A correct GPU run prints the first eight buffer elements (`0 1 2 3 4 5 6 7`) and
+- `DEVICE=cpu` — the CPU-device stand-in. No GPU or CUDA toolkit required. The
+  CPU shim emulates a full GPU launch: `pas_dev_launch` loops over the complete
+  launch geometry (`gx×gy×gz` blocks × `bx×by×bz` threads), setting thread-local
+  index registers (`__pas_tid_x` etc.) before each kernel call so the kernel sees
+  the correct `THREADIDX_*`/`BLOCKIDX_*` values. Produces correct output
+  identical to the CUDA path.
+
+Prerequisites by device:
+- **cpu**: Python + llvmlite, clang, `make -C runtime` (cpu archive, built by default).
+- **cuda**: all of the above plus CUDA toolkit headers, `-lcuda`, and an NVIDIA device;
+  `make -C runtime cuda` for the cuda archive.
+
+A correct run prints the first eight buffer elements (`0 1 2 3 4 5 6 7`) and
 `OK: all 256 indices correct`. The build rules live in
 [`../device-example.mk`](../device-example.mk).
 
diff --git a/examples/device_ptx/mandelbrot/README.md b/examples/device_ptx/mandelbrot/README.md
index e7b3f54..ec15d88 100644
--- a/examples/device_ptx/mandelbrot/README.md
+++ b/examples/device_ptx/mandelbrot/README.md
@@ -32,7 +32,8 @@ the companion *mandelbrot-gpu* repository.
 
 ```bash
 cd examples/device_ptx/mandelbrot
-make DEVICE=cuda run     # build the host + device, run on the GPU
+make DEVICE=cpu run      # no GPU needed
+make DEVICE=cuda run     # real GPU
 ```
 
 `DEVICE` selects the device-orchestration runtime shim at build time:
@@ -40,9 +41,17 @@ make DEVICE=cuda run     # build the host + device, run on the GPU
 - `DEVICE=cuda` — the real GPU path (CUDA Driver API shim + embedded PTX). Needs
   the CUDA toolkit headers, `-lcuda`, and an NVIDIA device. `SM` defaults to
   `sm_86` to mirror `mandelbrot.cu`.
-- `DEVICE=cpu` (the default) — the CPU-device stand-in, **not yet wired for this
-  example**; see [`../CPU_DEVICE_TODO.md`](../CPU_DEVICE_TODO.md) (it needs a
-  grid-stride kernel, a deferred kernel change).
+- `DEVICE=cpu` — the CPU-device stand-in. No GPU or CUDA toolkit required. The
+  CPU shim emulates a full GPU launch: `pas_dev_launch` loops over the complete
+  launch geometry (`gx×gy×gz` blocks × `bx×by×bz` threads), setting thread-local
+  index registers (`__pas_tid_x` etc.) before each kernel call so the kernel sees
+  the correct `THREADIDX_*`/`BLOCKIDX_*` values. Produces correct output
+  identical to the CUDA path.
+
+Prerequisites by device:
+- **cpu**: Python + llvmlite, clang, `make -C runtime` (cpu archive, built by default).
+- **cuda**: all of the above plus CUDA toolkit headers, `-lcuda`, and an NVIDIA device;
+  `make -C runtime cuda` for the cuda archive.
 
 The host orchestration is compiler-generated from the Pascal source; only the
 leaf runtime shim is C. The kernels are unchanged, so the emitted PTX remains the
diff --git a/runtime/cpu_device_shim.c b/runtime/cpu_device_shim.c
index 053bb6d..f2b888f 100644
--- a/runtime/cpu_device_shim.c
+++ b/runtime/cpu_device_shim.c
@@ -71,8 +71,8 @@ void pas_dev_free(void *dev_ptr) {
  *     pas_dev_launch(entry, gx,gy,gz, bx,by,bz, argv);  // cuLaunchKernel
  *
  * On the CPU device the "module" is the registry, get_function is a by-name
- * lookup returning the thunk, and launch invokes the thunk as a single-thread
- * grid (so a grid-stride kernel still covers the whole buffer).  Swapping this
+ * lookup returning the thunk, and launch drives it across the full launch
+ * geometry (see pas_dev_launch).  Swapping this
  * file for CUDA Driver API wrappers turns the *same* compiler output into a real
  * GPU launch with no Pascal-side change: load takes the embedded PTX blob,
  * get_function returns a CUfunction, launch is cuLaunchKernel.  (A CUDA shim
diff --git a/src/pascal1981/codegen/stmts.py b/src/pascal1981/codegen/stmts.py
index 0cb3da3..fbb05f6 100644
--- a/src/pascal1981/codegen/stmts.py
+++ b/src/pascal1981/codegen/stmts.py
@@ -378,9 +378,11 @@ def _kernel_launch_thunk(self, fn: ir.Function) -> ir.Function:
 
         The thunk ``void __pas_klaunch_<name>(i8** argv)`` unpacks ``argv`` into
         ``fn``'s parameter types and calls ``fn``.  This is the CPU-device launch
-        dispatch: ``pas_dev_launch`` invokes it as a single-thread grid, so a
-        grid-stride kernel still covers the whole buffer.  On a GPU the shim
-        dispatches the kernel by name out of the loaded module and the thunk is
+        dispatch: the CPU shim's ``pas_dev_launch`` loops over the full launch
+        geometry, setting thread-local index registers (``__pas_tid_x`` etc.)
+        before each thunk call, so the kernel sees the correct indices.  On a
+        GPU the shim dispatches the kernel by name out of the loaded module and
+        the thunk is
         never called -- but it is harmless to emit, and LAUNCH only ever appears
         in host code (never a device compiland), so the thunk never collides with
         a ``ptx_kernel`` calling convention.
@@ -420,8 +422,10 @@ def _codegen_device_orchestration(self, name: str, args: list) -> None:
         launch ABI: it marshals the kernel arguments into a ``void**`` array (the
         shape ``cuLaunchKernel`` consumes) and calls ``pas_dev_launch`` with the
         kernel-name string, a per-kernel dispatch thunk, the six geometry values,
-        and that array.  On the CPU device ``pas_dev_launch`` runs the thunk
-        (single-thread grid); swapping the shim for the CUDA driver path reuses
+        and that array.  On the CPU device ``pas_dev_launch`` loops over the
+        full launch geometry, setting thread-local index registers before each
+        thunk call (so the kernel sees the correct indices); swapping the shim
+        for the CUDA driver path reuses
         this exact call site -- it dispatches by name and ignores the thunk -- so
         no codegen change is needed to run the same program on a GPU (§5.2/§5.4).
 

From 1d89492bdb63546eade14c7460b600dcd8f42e65 Mon Sep 17 00:00:00 2001
From: Dixie Flatline a/k/a McCoy Pauley <ubuntu@localhost>
Date: Sat, 27 Jun 2026 05:15:04 +0000
Subject: [PATCH 08/10] Stop GPU test from destroying the shared runtime
 archive; tighten @requires_gpu

The GPU orchestration test's _build_cuda_runtime ran 'make -C runtime clean'
against the shared source-tree runtime/build/ (which every other link test
links as libpascalrt.a), then rebuilt only the cuda archive. If the cuda
build failed -- e.g. a driver-only box (nvidia-smi + libcuda.so.1 but no
CUDA toolkit headers) -- setUpClass raised, tearDownClass never ran, and
runtime/build/ was left empty, cascading link failures into every other
exe-requiring test (the trailing gcc-install-dir-libstdcxx warning in each
truncated summary hid the real 'no such file: libpascalrt.a' cause).

Two fixes:

1. Build the CUDA shim into an ISOLATED temp copy of the runtime sources so
   the shared runtime/build/ is never touched. A build failure raises
   unittest.SkipTest (clean skip) and leaks at most a /tmp dir, never a
   broken source tree. tearDownClass just removes the temp dir.

2. Tighten _probe_gpu to also require cuda.h (probed the way the Makefile
   looks for it, -I$CUDA_HOME/include), so @requires_gpu is False on a
   driver-only box and the test skips at collection rather than being
   selected and then failing the shim build.

Verified: header probe returns False for an empty CUDA_HOME, True when
cuda.h is planted; full suite 848 passed, 1 skipped.
---
 .../test_device_orchestration_gpu.py          | 52 +++++++++++++++----
 tests/support.py                              | 50 ++++++++++++++++--
 2 files changed, 86 insertions(+), 16 deletions(-)

diff --git a/tests/integration/test_device_orchestration_gpu.py b/tests/integration/test_device_orchestration_gpu.py
index ccbab7f..c775553 100644
--- a/tests/integration/test_device_orchestration_gpu.py
+++ b/tests/integration/test_device_orchestration_gpu.py
@@ -91,15 +91,32 @@
 """
 
 
-def _build_cuda_runtime() -> str:
-    """Build (once) the runtime archive with the CUDA shim; return its path."""
-    out = os.path.join(RUNTIME_DIR, "build", "libpascalrt.a")
-    subprocess.run(["make", "-C", RUNTIME_DIR, "clean"],
-                   capture_output=True, check=True)
-    r = subprocess.run(["make", "-C", RUNTIME_DIR, "DEVICE_SHIM=cuda"],
+def _build_cuda_runtime(tmpdir: str) -> str:
+    """Build the CUDA-shim runtime archive into an ISOLATED temp dir.
+
+    Building in a *copy* of the runtime sources keeps the shared source-tree
+    ``runtime/build/`` (which every other link test links against as
+    ``libpascalrt.a``) completely untouched -- so this test can neither delete
+    nor repoint it, and a build failure leaves only a leaked /tmp dir, never a
+    broken source tree.  Raises ``unittest.SkipTest`` (so a GPU box with a
+    broken/incomplete CUDA toolkit skips cleanly instead of erroring and
+    cascading failures into every other link test) if the build fails.
+    """
+    # Copy the runtime sources (skip any pre-existing build/ dir) so the
+    # Makefile can build self-contained inside the temp dir.
+    for name in os.listdir(RUNTIME_DIR):
+        if name == 'build':
+            continue
+        src = os.path.join(RUNTIME_DIR, name)
+        if os.path.isfile(src):
+            shutil.copy(src, os.path.join(tmpdir, name))
+    r = subprocess.run(["make", "-C", tmpdir, "DEVICE_SHIM=cuda"],
                        capture_output=True, text=True)
     if r.returncode != 0:
-        raise RuntimeError(f"CUDA runtime build failed: {r.stderr}")
+        raise unittest.SkipTest(f"CUDA runtime build failed: {r.stderr}")
+    out = os.path.join(tmpdir, "build", "libpascalrt.a")
+    if not os.path.exists(out):
+        raise unittest.SkipTest("CUDA runtime build failed: no archive produced")
     return out
 
 
@@ -108,13 +125,26 @@ class TestDeviceOrchestrationVectorAddGPU(unittest.TestCase):
 
     @classmethod
     def setUpClass(cls):
-        cls.runtime_lib = _build_cuda_runtime()
+        # Build into an isolated temp dir; never touch the shared
+        # runtime/build/ that every other link test depends on.
+        cls._runtime_tmp = tempfile.mkdtemp(prefix='pascalrt-cuda-')
+        try:
+            cls.runtime_lib = _build_cuda_runtime(cls._runtime_tmp)
+        except BaseException:
+            # setUpClass failure (incl. SkipTest) skips tearDownClass, so clean
+            # the temp dir here rather than leak it.
+            shutil.rmtree(cls._runtime_tmp, ignore_errors=True)
+            cls._runtime_tmp = None
+            raise
 
     @classmethod
     def tearDownClass(cls):
-        # Restore the default (CPU) shim so other suites/tools see the usual lib.
-        subprocess.run(["make", "-C", RUNTIME_DIR, "clean"], capture_output=True)
-        subprocess.run(["make", "-C", RUNTIME_DIR], capture_output=True)
+        # The only shared state we created is our private temp dir; the source
+        # tree's runtime/build/ was never touched, so there is nothing to
+        # restore.
+        tmp = getattr(cls, '_runtime_tmp', None)
+        if tmp:
+            shutil.rmtree(tmp, ignore_errors=True)
 
     def test_vector_add_runs_on_gpu(self):
         files = {
diff --git a/tests/support.py b/tests/support.py
index 46bf5d6..6296072 100644
--- a/tests/support.py
+++ b/tests/support.py
@@ -25,12 +25,41 @@
 CAN_BUILD_EXE = HAS_LLVMLITE and HAS_CLANG
 
 
+def _probe_cuda_headers() -> bool:
+    """True iff ``<cuda.h>`` is findable by clang the way the runtime build looks.
+
+    ``runtime/cuda_launch.c`` does ``#include <cuda.h>`` and the runtime
+    Makefile compiles it with ``-I$(CUDA_HOME)/include`` (plus clang's default
+    system search paths). Probe exactly that: a syntax-only compile of a
+    one-liner ``#include <cuda.h>`` with ``-I$CUDA_HOME/include``. This returns
+    False on a box that has the NVIDIA driver (``nvidia-smi`` / ``libcuda.so.1``)
+    but not the CUDA toolkit headers, so the build+run GPU test is skipped at
+    collection rather than selected and then failing the shim compile.
+    """
+    if not HAS_CLANG:
+        return False
+    cuda_home = os.environ.get("CUDA_HOME", "/usr/local/cuda")
+    try:
+        r = subprocess.run(
+            ["clang", "-x", "c", "-fsyntax-only", "-Wno-unknown-pragmas",
+             "-I", os.path.join(cuda_home, "include"), "-"],
+            input="#include <cuda.h>\n",
+            capture_output=True, text=True, timeout=10)
+        return r.returncode == 0
+    except Exception:
+        return False
+
+
 def _probe_gpu() -> bool:
     """True iff a real CUDA GPU run is possible here.
 
     Requires: an NVIDIA device visible to the driver, the NVPTX backend in this
-    llvmlite (to emit PTX), clang, and a linkable libcuda.  Probed cheaply so the
-    @requires_gpu tests skip cleanly on CPU-only machines.
+    llvmlite (to emit PTX), clang, a linkable libcuda, AND the CUDA toolkit
+    headers (``cuda.h``) to build the CUDA shim.  The last check is what skips
+    a driver-only box (``nvidia-smi`` + ``libcuda.so.1`` but no toolkit): the
+    test builds+runs the shim, so the headers are a hard prerequisite, not just
+    the driver.  Probed cheaply so the @requires_gpu tests skip cleanly on
+    CPU-only and driver-only machines.
     """
     if not CAN_BUILD_EXE:
         return False
@@ -48,9 +77,20 @@ def _probe_gpu() -> bool:
     if any(Path(p).exists() for p in (
             "/usr/lib/x86_64-linux-gnu/libcuda.so",
             "/usr/lib/x86_64-linux-gnu/libcuda.so.1")):
-        return True
-    cuda_home = os.environ.get("CUDA_HOME", "/usr/local/cuda")
-    return Path(cuda_home, "lib64", "stubs", "libcuda.so").exists()
+        has_libcuda = True
+    else:
+        cuda_home = os.environ.get("CUDA_HOME", "/usr/local/cuda")
+        has_libcuda = Path(cuda_home, "lib64", "stubs", "libcuda.so").exists()
+    if not has_libcuda:
+        return False
+    # The CUDA shim (runtime/cuda_launch.c) #includes <cuda.h>, built with
+    # -I$(CUDA_HOME)/include. A box can have the *driver* (nvidia-smi +
+    # libcuda.so.1) but not the *toolkit* headers, which is enough to run an
+    # already-built shim but NOT to build it -- and this is a build+run test.
+    # Probe the header exactly the way the Makefile looks for it so @requires_gpu
+    # is false on a driver-only box (the test skips at collection) instead of
+    # being selected and then failing the shim build.
+    return _probe_cuda_headers()
 
 
 HAS_GPU = _probe_gpu()

From 44f33cca5bc95e76c1d529dddd847ef1fb3ce512 Mon Sep 17 00:00:00 2001
From: Dixie Flatline a/k/a McCoy Pauley <ubuntu@localhost>
Date: Sat, 27 Jun 2026 05:25:40 +0000
Subject: [PATCH 09/10] Migrate GPU orchestration test to the --device-backend
 cuda path

The CPU TLS work (commit 7713a86) regressed this test: it compiled the
device unit to an x86 dev.ll (for the legacy launch-thunk host path), and
that dev.ll now references __pas_tid_x etc. -- TLS globals defined only in
cpu_device_shim.c, not the CUDA shim -- so the GPU link failed with
'undefined reference to __pas_tid_x'.

Migrate the test off the legacy --embed-device-ptx path onto the decoupled
cuda backend: compile the host with device_backend='cuda' (emits no launch
thunk and no kernel-symbol reference, so no dev.ll is linked), objectify the
PTX into a NUL-terminated __pas_device_ptx blob, and link host.ll + blob.o
+ cuda shim + -lcuda. This is the same 3-command flow the cleanup work
established for the examples.

Verified (no GPU needed): host .ll has external __pas_device_ptx, no
__pas_klaunch, no kernel def; ld -r host.o + blob.o links clean -- no
undefined TLS symbol. The actual CUDA shim build + run remain @requires_gpu.
Full suite 848 passed, 1 skipped.
---
 .../test_device_orchestration_gpu.py          | 51 ++++++++++++-------
 1 file changed, 33 insertions(+), 18 deletions(-)

diff --git a/tests/integration/test_device_orchestration_gpu.py b/tests/integration/test_device_orchestration_gpu.py
index c775553..b974568 100644
--- a/tests/integration/test_device_orchestration_gpu.py
+++ b/tests/integration/test_device_orchestration_gpu.py
@@ -10,9 +10,12 @@
 
 Gated by ``@requires_gpu`` so it skips cleanly on CPU-only machines.
 
-The device kernel is compiled to PTX (NVPTX backend) and embedded into the host
-compiland via ``--embed-device-ptx``; the host links the CUDA shim archive plus
-``-lcuda``.  Asserts the result is ``0 3 6 … 21``.
+The device kernel is compiled to PTX (NVPTX backend) and packaged as a
+NUL-terminated ``__pas_device_ptx`` data object that the host (compiled with
+``--device-backend cuda``) references as an external symbol; the host links
+that blob + the CUDA shim archive plus ``-lcuda``.  The host emits no
+in-process launch thunk and no kernel-symbol reference, so no device-unit
+``.ll`` is linked.  Asserts the result is ``0 3 6 … 21``.
 """
 
 import os
@@ -23,7 +26,7 @@
 
 from pascal1981.compile_to_ptx import compile_file_to_ptx
 from pascal1981.features import resolve_features
-from tests.support import (RUNTIME_DIR, compile_pascal_file, requires_gpu,
+from tests.support import (RUNTIME_DIR, requires_gpu,
                            temporary_pascal_project)
 
 _WIDE = resolve_features(overrides=['wide-integers'])
@@ -154,7 +157,6 @@ def test_vector_add_runs_on_gpu(self):
         }
         cuda_home = os.environ.get("CUDA_HOME", "/usr/local/cuda")
         with temporary_pascal_project(files) as proj:
-            inc = os.path.join(proj, 'vadd.inc')
             dev = os.path.join(proj, 'vadd.pas')
             main = os.path.join(proj, 'main.pas')
 
@@ -165,15 +167,13 @@ def test_vector_add_runs_on_gpu(self):
             with open(ptx_path, 'w') as f:
                 f.write(ptx)
 
-            # 2/3. device unit + interface -> host x86 .ll (the device .ll
-            # defines the kernel symbol the host launch thunk references; the
-            # real kernel comes from the embedded PTX at run time).
-            dev_ll = compile_pascal_file(dev, os.path.join(proj, 'vadd.ll'),
-                                         features=_WIDE)
-            compile_pascal_file(inc, os.path.join(proj, 'vadd-iface.ll'),
-                                features=_WIDE)
-
-            # 4. host program -> .ll, embedding the PTX.
+            # 2. host program -> .ll with the cuda device backend.  This emits
+            # no in-process launch thunk and no kernel-symbol reference, so the
+            # host .ll needs no device-unit .ll to link against -- the real
+            # kernel comes from the PTX loaded at run time.  The PTX text is
+            # packaged as its own NUL-terminated __pas_device_ptx data object
+            # that the host references as an external symbol (the blob the CUDA
+            # shim reads as a C-string and cuModuleLoadData's).
             from pascal1981.codegen import compile_to_llvm
             from pascal1981.parser import parse_file
             from pascal1981.type_checker import PascalTypeChecker
@@ -183,12 +183,27 @@ def test_vector_add_runs_on_gpu(self):
             main_ll = os.path.join(proj, 'main.ll')
             with open(main_ll, 'w') as f:
                 f.write(compile_to_llvm(ast, source_file=main, features=_WIDE,
-                                        embed_device_ptx_text=ptx))
-
-            # 5. link host + device .ll + CUDA shim + -lcuda.
+                                        device_backend='cuda'))
+
+            # 3. objectify the PTX text into a __pas_device_ptx data blob
+            # (PTX *text* + trailing NUL; NOT ptxas/cubin output).  incbin uses
+            # the absolute ptx_path so it resolves regardless of assembler CWD.
+            blob_s = os.path.join(proj, 'dev_ptx_blob.s')
+            with open(blob_s, 'w') as f:
+                f.write('\t.section .rodata\n'
+                        '\t.globl __pas_device_ptx\n'
+                        '__pas_device_ptx:\n'
+                        f'\t.incbin "{ptx_path}"\n'
+                        '\t.byte 0\n')
+            blob_o = os.path.join(proj, 'dev_ptx_blob.o')
+            asm = subprocess.run(['clang', '-c', blob_s, '-o', blob_o],
+                                 capture_output=True, text=True)
+            self.assertEqual(asm.returncode, 0, msg=asm.stderr)
+
+            # 4. link host .ll + PTX blob + CUDA shim + -lcuda.
             exe = os.path.join(proj, 'vadd-gpu')
             link = subprocess.run(
-                ['clang', main_ll, dev_ll, self.runtime_lib,
+                ['clang', main_ll, blob_o, self.runtime_lib,
                  '-L' + os.path.join(cuda_home, 'lib64', 'stubs'), '-lcuda',
                  '-o', exe],
                 capture_output=True, text=True)

From bddfc95670b16e5f4d39c445d306cbe16b014b5d Mon Sep 17 00:00:00 2001
From: Dixie Flatline a/k/a McCoy Pauley <ubuntu@localhost>
Date: Sat, 27 Jun 2026 05:29:36 +0000
Subject: [PATCH 10/10] README: state the 'make -C runtime' prerequisite for
 tests

The testing section omitted that link-requiring tests link the hardcoded
runtime/build/libpascalrt.a and FAIL (not skip) without it. State the
prerequisite up front so a clean-tree run isn't a surprise.
---
 README.md | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/README.md b/README.md
index c4bb80d..1f86a77 100644
--- a/README.md
+++ b/README.md
@@ -472,10 +472,17 @@ One unified test suite built on `pytest`, with automatic detection of optional d
 ### Run the entire test suite
 
 ```bash
+# Build the C runtime archive once (link-requiring tests link against it).
+make -C runtime
 # All tests from a source checkout; codegen tests auto-skip if llvmlite/clang are unavailable
 PYTHONPATH=src python3 -m pytest tests/ -q
 ```
 
+`make -C runtime` is **required** for the integration/link tests: they link
+`runtime/build/libpascalrt.a` (hardcoded in `tests/support.py`), and without
+that archive they *fail* (they do not skip). Parser/typecheck tests need
+no dependencies at all; codegen IR-only tests need `llvmlite` but not the archive.
+
 If you installed the package into the active environment, `PYTHONPATH=src` is not
 needed.