Migrate 15 gpu_baseline.cu files to unified compute_only interface#5
Open
b010001y wants to merge 1 commit into
Open
Migrate 15 gpu_baseline.cu files to unified compute_only interface#5b010001y wants to merge 1 commit into
b010001y wants to merge 1 commit into
Conversation
The repo migrated all task_io.cu adapters and cpu_reference.c files to a single solution_compute(N, all_inputs..., output) function in commit a453c5b ("Migrate all 44 tasks to unified compute_only interface"). However, the 15 gpu_baseline.cu files were not migrated and still expose the old solution_init + solution_compute(N, output_only) split. As a result they link successfully (the LLM-facing names match) but produce garbage at runtime: solution_init is never called, so device pointers stay null when solution_compute launches its kernel. This commit migrates each gpu_baseline.cu to the new interface following the same pattern: - Delete solution_init entirely. - Define solution_free before solution_compute (so the lazy-init block can call it). - solution_compute now takes the full set of inputs (signature copied verbatim from the task's task_io.cu). - Move all cudaMalloc into solution_compute, guarded by a shape check so allocation only happens on the first call (or when shape changes). - H2D copies move into solution_compute and run every call, per the unified contract: "the harness passes full host-side inputs every call". - cudaDeviceSynchronize() added before return. Verified on small data (sm_89 / L20X). 13 of 15 pass exact/tol-bounded correctness checks vs expected_output.txt: | Task | GPU(ms) | CPU(ms) | | --- | ---: | ---: | | black_scholes | 0.375 | 7.259 | | dbscan | 203.010 | 239.577 | | dtw_distance | 0.316 | 1806.643 | | monte_carlo | 0.363 | 1332.731 | | hausdorff_distance | 0.102 | 1.359 | | euclidean_distance_matrix | 0.054 | 0.377 | | held_karp_tsp | 0.601 | 22.403 | | spmv_csr | 0.107 | 0.033 | | max_flow_push_relabel | 1948.765 | 88.825 | | sph_cell_index | 0.409 | 6.978 | | sph_position | 0.351 | 0.430 | | sph_forces | 0.304 | 4.562 | | pdlp | 7.128 | 2.352 | Two files (bonds_pricing, repo_pricing) have a separate, pre-existing forward-declaration bug in their __device__ helper functions (bondAccruedAmountGpu, cashFlowsNpvGpu, etc. are called before they are declared). Their old-interface versions on main also fail to compile with the same errors, so this is not a regression. The migration in this commit is correct on its own; the forward-decl bug should be addressed separately.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Commit a453c5b ("Migrate all 44 tasks to unified compute_only interface") migrated
task_io.cuandcpu_reference.cfor every task to a singlesolution_compute(N, all_inputs..., output)function, but the 15gpu_baseline.cufiles were not migrated. They still expose the old two-phase interface:Result: each baseline links successfully (the LLM-facing names match), but at runtime
solution_initis never invoked, so device pointers stay null whensolution_computelaunches its kernel. Outputs are garbage. Caught end-to-end ontasks/black_scholes—--validatereported 0.000 ms / all-zero output despite a clean compile.Fix
For each of the 15
gpu_baseline.cufiles, apply the same mechanical migration:solution_initentirely.solution_freebeforesolution_computeso the lazy-init block can call it.solution_computetakes the full set of inputs (signature copied verbatim from each task'stask_io.cu).cudaMallocintosolution_compute, guarded by a shape check so allocation only happens on the first call (or when shape changes between calls).solution_computeand run every call, per the unified contract: "the harness passes full host-side inputs every call".cudaDeviceSynchronize()added before return (the contract requires sync beforesolution_computereturns).No kernels were touched. Only the host-side init/compute/free section at the bottom of each file changed.
Verification
Ran
framework/harness_gpu.cu + tasks/<t>/task_io.cu + tasks/<t>/gpu_baseline.cuagainst thesmallHuggingFace dataset (Cosmoscd/AccelEval) on sm_89 / NVIDIA L20X. Output diffed againstexpected_output.txtusing each task'scorrectness.atol/rtolfromtask.json.13 of 15 files compile, run, and pass correctness:
Pre-existing issue (not caused by this PR)
bonds_pricing/gpu_baseline.cuandrepo_pricing/gpu_baseline.cufail to compile with errors like:These
__device__helpers are defined further down in the same file but not forward-declared. The same errors occur onmainbefore any of my changes — verified bygit stash && nvcc ... && git stash pop. So this is pre-existing and unrelated to the migration. I included these two files in the PR anyway because the API change is correct and necessary; the forward-decl bug is independent and should be fixed separately.Test plan
nvcc -O2 -arch=sm_89(13 cleanly, 2 hit the pre-existing forward-decl bug above).expected_output.txton thesmallHF dataset within theirtask.jsontolerances.solution_freeis callable repeatedly (idempotent — guarded by null pointer checks).medium/large(not run here; should work given the migration is shape-agnostic).Reproducer
python3 scripts/download_data.py small /usr/local/cuda-12.4/bin/nvcc -O2 -arch=sm_89 -I framework/ \ -diag-suppress=177 -diag-suppress=550 \ framework/harness_gpu.cu \ tasks/black_scholes/task_io.cu \ tasks/black_scholes/gpu_baseline.cu \ -o /tmp/bs_gpu /tmp/bs_gpu tasks/black_scholes/data/small --validate # main (broken): TIME_MS: 0.000, output.txt all zeros # this PR: TIME_MS: 0.375, 100000/100000 within atol=0.01