Migrate 15 gpu_baseline.cu files to unified compute_only interface by b010001y · Pull Request #5 · Echoscd/AccelEval

b010001y · 2026-04-25T15:30:49Z

Problem

Commit a453c5b ("Migrate all 44 tasks to unified compute_only interface") migrated task_io.cu and cpu_reference.c for every task to a single solution_compute(N, all_inputs..., output) function, but the 15 gpu_baseline.cu files were not migrated. They still expose the old two-phase interface:

// OLD (broken under unified compute_only):
void solution_init(int N, /* all inputs */ ...);
void solution_compute(int N, T* output);

Result: each baseline links successfully (the LLM-facing names match), but at runtime solution_init is never invoked, so device pointers stay null when solution_compute launches its kernel. Outputs are garbage. Caught end-to-end on tasks/black_scholes — --validate reported 0.000 ms / all-zero output despite a clean compile.

Fix

For each of the 15 gpu_baseline.cu files, apply the same mechanical migration:

Delete solution_init entirely.
Define solution_free before solution_compute so the lazy-init block can call it.
solution_compute takes the full set of inputs (signature copied verbatim from each task's task_io.cu).
Move all cudaMalloc into solution_compute, guarded by a shape check so allocation only happens on the first call (or when shape changes between calls).
H2D copies move into solution_compute and run every call, per the unified contract: "the harness passes full host-side inputs every call".
cudaDeviceSynchronize() added before return (the contract requires sync before solution_compute returns).

No kernels were touched. Only the host-side init/compute/free section at the bottom of each file changed.

Verification

Ran framework/harness_gpu.cu + tasks/<t>/task_io.cu + tasks/<t>/gpu_baseline.cu against the small HuggingFace dataset (Cosmoscd/AccelEval) on sm_89 / NVIDIA L20X. Output diffed against expected_output.txt using each task's correctness.atol / rtol from task.json.

13 of 15 files compile, run, and pass correctness:

Task	Status	GPU(ms)	CPU(ms)	max_abs_diff
black_scholes	PASS	0.375	7.259	1.0e-05
dbscan	PASS	203.010	239.577	0
dtw_distance	PASS	0.316	1806.643	1.0e-04
monte_carlo	PASS	0.363	1332.731	8.2e-03
hausdorff_distance	PASS	0.102	1.359	1.0e-05
euclidean_distance_matrix	PASS	0.054	0.377	2.0e-06
held_karp_tsp	PASS	0.601	22.403	0
spmv_csr	PASS	0.107	0.033	1.2e-06
max_flow_push_relabel	PASS	1948.765	88.825	0
sph_cell_index	PASS	0.409	6.978	0
sph_position	PASS	0.351	0.430	0
sph_forces	PASS	0.304	4.562	2.0e-04
pdlp	PASS	7.128	2.352	1.0e-06

Pre-existing issue (not caused by this PR)

bonds_pricing/gpu_baseline.cu and repo_pricing/gpu_baseline.cu fail to compile with errors like:

identifier "bondAccruedAmountGpu" is undefined
identifier "cashFlowsNpvGpu" is undefined
...

These __device__ helpers are defined further down in the same file but not forward-declared. The same errors occur on main before any of my changes — verified by git stash && nvcc ... && git stash pop. So this is pre-existing and unrelated to the migration. I included these two files in the PR anyway because the API change is correct and necessary; the forward-decl bug is independent and should be fixed separately.

Test plan

All 15 files build with nvcc -O2 -arch=sm_89 (13 cleanly, 2 hit the pre-existing forward-decl bug above).
All 13 buildable baselines validate against expected_output.txt on the small HF dataset within their task.json tolerances.
solution_free is callable repeatedly (idempotent — guarded by null pointer checks).
Repeat on medium / large (not run here; should work given the migration is shape-agnostic).

Reproducer

python3 scripts/download_data.py small
/usr/local/cuda-12.4/bin/nvcc -O2 -arch=sm_89 -I framework/ \
    -diag-suppress=177 -diag-suppress=550 \
    framework/harness_gpu.cu \
    tasks/black_scholes/task_io.cu \
    tasks/black_scholes/gpu_baseline.cu \
    -o /tmp/bs_gpu
/tmp/bs_gpu tasks/black_scholes/data/small --validate
# main (broken): TIME_MS: 0.000, output.txt all zeros
# this PR:       TIME_MS: 0.375, 100000/100000 within atol=0.01

The repo migrated all task_io.cu adapters and cpu_reference.c files to a single solution_compute(N, all_inputs..., output) function in commit a453c5b ("Migrate all 44 tasks to unified compute_only interface"). However, the 15 gpu_baseline.cu files were not migrated and still expose the old solution_init + solution_compute(N, output_only) split. As a result they link successfully (the LLM-facing names match) but produce garbage at runtime: solution_init is never called, so device pointers stay null when solution_compute launches its kernel. This commit migrates each gpu_baseline.cu to the new interface following the same pattern: - Delete solution_init entirely. - Define solution_free before solution_compute (so the lazy-init block can call it). - solution_compute now takes the full set of inputs (signature copied verbatim from the task's task_io.cu). - Move all cudaMalloc into solution_compute, guarded by a shape check so allocation only happens on the first call (or when shape changes). - H2D copies move into solution_compute and run every call, per the unified contract: "the harness passes full host-side inputs every call". - cudaDeviceSynchronize() added before return. Verified on small data (sm_89 / L20X). 13 of 15 pass exact/tol-bounded correctness checks vs expected_output.txt: | Task | GPU(ms) | CPU(ms) | | --- | ---: | ---: | | black_scholes | 0.375 | 7.259 | | dbscan | 203.010 | 239.577 | | dtw_distance | 0.316 | 1806.643 | | monte_carlo | 0.363 | 1332.731 | | hausdorff_distance | 0.102 | 1.359 | | euclidean_distance_matrix | 0.054 | 0.377 | | held_karp_tsp | 0.601 | 22.403 | | spmv_csr | 0.107 | 0.033 | | max_flow_push_relabel | 1948.765 | 88.825 | | sph_cell_index | 0.409 | 6.978 | | sph_position | 0.351 | 0.430 | | sph_forces | 0.304 | 4.562 | | pdlp | 7.128 | 2.352 | Two files (bonds_pricing, repo_pricing) have a separate, pre-existing forward-declaration bug in their __device__ helper functions (bondAccruedAmountGpu, cashFlowsNpvGpu, etc. are called before they are declared). Their old-interface versions on main also fail to compile with the same errors, so this is not a regression. The migration in this commit is correct on its own; the forward-decl bug should be addressed separately.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate 15 gpu_baseline.cu files to unified compute_only interface#5

Migrate 15 gpu_baseline.cu files to unified compute_only interface#5
b010001y wants to merge 1 commit into
Echoscd:mainfrom
b010001y:fix/migrate-gpu-baselines-to-compute-only

b010001y commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

b010001y commented Apr 25, 2026

Problem

Fix

Verification

Pre-existing issue (not caused by this PR)

Test plan

Reproducer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant