Skip to content

Migrate 15 gpu_baseline.cu files to unified compute_only interface#5

Open
b010001y wants to merge 1 commit into
Echoscd:mainfrom
b010001y:fix/migrate-gpu-baselines-to-compute-only
Open

Migrate 15 gpu_baseline.cu files to unified compute_only interface#5
b010001y wants to merge 1 commit into
Echoscd:mainfrom
b010001y:fix/migrate-gpu-baselines-to-compute-only

Conversation

@b010001y

Copy link
Copy Markdown
Contributor

Problem

Commit a453c5b ("Migrate all 44 tasks to unified compute_only interface") migrated task_io.cu and cpu_reference.c for every task to a single solution_compute(N, all_inputs..., output) function, but the 15 gpu_baseline.cu files were not migrated. They still expose the old two-phase interface:

// OLD (broken under unified compute_only):
void solution_init(int N, /* all inputs */ ...);
void solution_compute(int N, T* output);

Result: each baseline links successfully (the LLM-facing names match), but at runtime solution_init is never invoked, so device pointers stay null when solution_compute launches its kernel. Outputs are garbage. Caught end-to-end on tasks/black_scholes--validate reported 0.000 ms / all-zero output despite a clean compile.

Fix

For each of the 15 gpu_baseline.cu files, apply the same mechanical migration:

  • Delete solution_init entirely.
  • Define solution_free before solution_compute so the lazy-init block can call it.
  • solution_compute takes the full set of inputs (signature copied verbatim from each task's task_io.cu).
  • Move all cudaMalloc into solution_compute, guarded by a shape check so allocation only happens on the first call (or when shape changes between calls).
  • H2D copies move into solution_compute and run every call, per the unified contract: "the harness passes full host-side inputs every call".
  • cudaDeviceSynchronize() added before return (the contract requires sync before solution_compute returns).

No kernels were touched. Only the host-side init/compute/free section at the bottom of each file changed.

Verification

Ran framework/harness_gpu.cu + tasks/<t>/task_io.cu + tasks/<t>/gpu_baseline.cu against the small HuggingFace dataset (Cosmoscd/AccelEval) on sm_89 / NVIDIA L20X. Output diffed against expected_output.txt using each task's correctness.atol / rtol from task.json.

13 of 15 files compile, run, and pass correctness:

Task Status GPU(ms) CPU(ms) max_abs_diff
black_scholes PASS 0.375 7.259 1.0e-05
dbscan PASS 203.010 239.577 0
dtw_distance PASS 0.316 1806.643 1.0e-04
monte_carlo PASS 0.363 1332.731 8.2e-03
hausdorff_distance PASS 0.102 1.359 1.0e-05
euclidean_distance_matrix PASS 0.054 0.377 2.0e-06
held_karp_tsp PASS 0.601 22.403 0
spmv_csr PASS 0.107 0.033 1.2e-06
max_flow_push_relabel PASS 1948.765 88.825 0
sph_cell_index PASS 0.409 6.978 0
sph_position PASS 0.351 0.430 0
sph_forces PASS 0.304 4.562 2.0e-04
pdlp PASS 7.128 2.352 1.0e-06

Pre-existing issue (not caused by this PR)

bonds_pricing/gpu_baseline.cu and repo_pricing/gpu_baseline.cu fail to compile with errors like:

identifier "bondAccruedAmountGpu" is undefined
identifier "cashFlowsNpvGpu" is undefined
...

These __device__ helpers are defined further down in the same file but not forward-declared. The same errors occur on main before any of my changes — verified by git stash && nvcc ... && git stash pop. So this is pre-existing and unrelated to the migration. I included these two files in the PR anyway because the API change is correct and necessary; the forward-decl bug is independent and should be fixed separately.

Test plan

  • All 15 files build with nvcc -O2 -arch=sm_89 (13 cleanly, 2 hit the pre-existing forward-decl bug above).
  • All 13 buildable baselines validate against expected_output.txt on the small HF dataset within their task.json tolerances.
  • solution_free is callable repeatedly (idempotent — guarded by null pointer checks).
  • Repeat on medium / large (not run here; should work given the migration is shape-agnostic).

Reproducer

python3 scripts/download_data.py small
/usr/local/cuda-12.4/bin/nvcc -O2 -arch=sm_89 -I framework/ \
    -diag-suppress=177 -diag-suppress=550 \
    framework/harness_gpu.cu \
    tasks/black_scholes/task_io.cu \
    tasks/black_scholes/gpu_baseline.cu \
    -o /tmp/bs_gpu
/tmp/bs_gpu tasks/black_scholes/data/small --validate
# main (broken): TIME_MS: 0.000, output.txt all zeros
# this PR:       TIME_MS: 0.375, 100000/100000 within atol=0.01

The repo migrated all task_io.cu adapters and cpu_reference.c files to a
single solution_compute(N, all_inputs..., output) function in commit
a453c5b ("Migrate all 44 tasks to unified compute_only interface").
However, the 15 gpu_baseline.cu files were not migrated and still expose
the old solution_init + solution_compute(N, output_only) split. As a
result they link successfully (the LLM-facing names match) but produce
garbage at runtime: solution_init is never called, so device pointers
stay null when solution_compute launches its kernel.

This commit migrates each gpu_baseline.cu to the new interface following
the same pattern:
- Delete solution_init entirely.
- Define solution_free before solution_compute (so the lazy-init block
  can call it).
- solution_compute now takes the full set of inputs (signature copied
  verbatim from the task's task_io.cu).
- Move all cudaMalloc into solution_compute, guarded by a shape check
  so allocation only happens on the first call (or when shape changes).
- H2D copies move into solution_compute and run every call, per the
  unified contract: "the harness passes full host-side inputs every call".
- cudaDeviceSynchronize() added before return.

Verified on small data (sm_89 / L20X). 13 of 15 pass exact/tol-bounded
correctness checks vs expected_output.txt:

| Task | GPU(ms) | CPU(ms) |
| --- | ---: | ---: |
| black_scholes              |    0.375 |    7.259 |
| dbscan                     |  203.010 |  239.577 |
| dtw_distance               |    0.316 | 1806.643 |
| monte_carlo                |    0.363 | 1332.731 |
| hausdorff_distance         |    0.102 |    1.359 |
| euclidean_distance_matrix  |    0.054 |    0.377 |
| held_karp_tsp              |    0.601 |   22.403 |
| spmv_csr                   |    0.107 |    0.033 |
| max_flow_push_relabel      | 1948.765 |   88.825 |
| sph_cell_index             |    0.409 |    6.978 |
| sph_position               |    0.351 |    0.430 |
| sph_forces                 |    0.304 |    4.562 |
| pdlp                       |    7.128 |    2.352 |

Two files (bonds_pricing, repo_pricing) have a separate, pre-existing
forward-declaration bug in their __device__ helper functions
(bondAccruedAmountGpu, cashFlowsNpvGpu, etc. are called before they are
declared). Their old-interface versions on main also fail to compile
with the same errors, so this is not a regression. The migration in
this commit is correct on its own; the forward-decl bug should be
addressed separately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant