Skip to content

Set gpu tpb#736

Open
otbrown wants to merge 63 commits into
develfrom
set_gpu_tpb
Open

Set gpu tpb#736
otbrown wants to merge 63 commits into
develfrom
set_gpu_tpb

Conversation

@otbrown
Copy link
Copy Markdown
Collaborator

@otbrown otbrown commented Apr 24, 2026

Creating a facility for users to runtime set threads per block for tuning the GPU implementation. NOTE: only applies to kernels that are not handled by Thrust, which does its own thing. Resolves #735.

I considered and rejected the idea of creating a symmetric interface for the CPU for users who don't know OMP_NUM_THREADS or omp_set_num_threads() exist, but that's much riskier as the point of truth is external (in the OpenMP runtime).

TODO:

  • Should gpu_getNumThreadsPerBlock return a qindex? Probably.
  • Create a new home for user facing API, as environment doesn't really make sense. Kicking this back to next release.
  • Add a compile time default value -- that way expert maintainers can compile a tuned default into a library which is used on a system.
  • Query seemingly unused branch at
    if constexpr (NumTargs != -1) {
  • Add TPB to QuEST GPU environment reporting.
  • Add tests for new interface.
  • @JPRichings To check if this is really worthwhile, but please wait a week to tell me if it isn't.
    • It is!

@otbrown
Copy link
Copy Markdown
Collaborator Author

otbrown commented Apr 24, 2026

Rudimentary testing done with:

#include <cstdio>
#include "quest.h"

int main (void)
{
  const int NQUBITS = 24;
  const int TPB = 32;


  initQuESTEnv();
  reportQuESTEnv();

  std::printf("Initial number of threads per block: %d\n", getQuESTGpuThreadsPerBlock());

  setQuESTGpuThreadsPerBlock(TPB);
  std::printf("New number of threads per block: %d\n", getQuESTGpuThreadsPerBlock());

  Qureg qureg = createForcedQureg(NQUBITS);

  std::printf("Initialising Qureg.\n");
  initPlusState(qureg);
  reportQureg(qureg);

  std::printf("Applying Quantum Fourier Transform.\n");
  applyFullQuantumFourierTransform(qureg, false);
  reportQureg(qureg);

  destroyQureg(qureg);
  finalizeQuESTEnv();

  return 0;
}

@otbrown otbrown self-assigned this Apr 24, 2026
@JPRichings
Copy link
Copy Markdown
Contributor

Why would gpu_getNumThreadsPerBlock be a qindex this is not a quantum quantity. uint should be fine ( I am sure there is a recommendation from the cuda api we can match.

@TysonRayJones
Copy link
Copy Markdown
Member

Is there an advantage to users having to set this as a runtime hyperparameter? My (mostly undeveloped) belief is we can use occupancy tools (alluded to here) to automate this. I definitely shy from giving users a greater onus to optimise for their settings (like other prolific softwares), which the v4 overhaul was supposed to avoid (via e.g. the autodeployer).

Note too that the kernels so far are very primitive - each thread handles the updating of the minimum possible number of amplitudes (often just one!). I quite like that because it's very readable and simple (great for an open-source scientific project) but is an obvious site for optimisation.

Why would gpu_getNumThreadsPerBlock be a qindex this is not a quantum quantity. uint should be fine ( I am sure there is a recommendation from the cuda api we can match.

It's true that it will never be anywhere as big as the quantities qindex is expected to store (like the number of basis states), but I have already used it in places where I thought an int might be insufficient. Inoffensive either way as uint or qindex imo

@JPRichings
Copy link
Copy Markdown
Contributor

Hi Tyson,

I just noticed the fixed value to 128 and have a feeling that it was large.

I just wanted a handle so I could write a benchmark so we can easily automate performance tuning ourselves.

I have not played with the occupancy tools but I should take a proper look as this might solve this automatically.

My other concern is that there are differences between Nvidia and AMD on optimal sizes due to hardware differences so we might not be able to reply on the occupancy tuning in all cases unless this becomes available on all platforms.

Comment thread quest/src/cpu/cpu_config.cpp
Comment thread quest/src/gpu/gpu_subroutines.cpp Outdated
@TysonRayJones
Copy link
Copy Markdown
Member

I just noticed the fixed value to 128 and have a feeling that it was large.

I guess it's very GPU specific! I think 128 was motivated by thinking of CC=3, which has a max active blocks per SM of 16, and a max active threads per SM of 2048. So using 128 threads per block perfectly maximally occupies the SMs (when there are enough amplitudes to admit more than 16 blocks per SM, of course!)

For illustration, the next smallest size is 96 (it must be a multiple of 32, else threads within a warp will be idle), which yields a number of active threads of 16 * 96 = 1536, which wastes 2048 - 1536 = 512 threads per SM!

Of course, newer GPUs support more active blocks per SM (even when the max active threads per SM is unchanged). E.g. CC=8 supports up to 32 active blocks per SM, so we could shrink to 64 threads per block while achieving the same occupancy - but I don't have a great intuition for the effect when we're memory-bandwidth bound.

Certainly seems prudent to consult a CUDA runtime API, if that doesn't hurt our AMD compatibility!

@otbrown
Copy link
Copy Markdown
Collaborator Author

otbrown commented Apr 27, 2026

Apologies, probably won't get to look at this again this week, but very happy to set this value programmatically if it can be done!

As it's architecture dependent, we definitely do need a way to adjust it, and ideally both at runtime and compile time. At compile time, so kindly HPC support teams can compile and maintain a tuned version, and at runtime, so they can scan through values without having to recompile in between. I'll have a chat with James abour approaches later this week!

I 100% agree that we don't really want unknowing users messing around with this. I think something like an architecture.h or perftune.h or similar might be the best solution. A set of functionality that we explicitly document is for users who know what they are doing to tune the performance of the library for a specific architecture. It might be this is the only value in there for the time being, but for slingshot-11 reasons we need to add a parameter capping the total in-flight data and this would be a good spot for that too.

@TysonRayJones
Copy link
Copy Markdown
Member

Fair enough - you've convinced me! Being able to runtime adjust is of course extremely helping during development of a user-friendlier adaptive system anyhow.

I like the sound of perftune.h - it could also go into debug.h in the interim to there being more performance-tuning specific functionality.

Comment thread quest/src/api/environment.cpp Outdated
@otbrown
Copy link
Copy Markdown
Collaborator Author

otbrown commented May 3, 2026

Should validate TPB is multiple of 32!

Comment thread quest/src/api/environment.cpp Outdated
Comment thread quest/include/environment.h Outdated
Comment thread quest/include/environment.h Outdated
Comment thread quest/src/api/environment.cpp Outdated
Comment thread quest/src/api/environment.cpp Outdated
Comment thread quest/src/api/environment.cpp Outdated
Comment thread quest/src/gpu/gpu_config.cpp Outdated
Comment thread quest/src/gpu/gpu_config.cpp
Comment thread quest/src/gpu/gpu_kernels.cuh Outdated
Comment thread quest/src/gpu/gpu_subroutines.cpp Outdated
@otbrown
Copy link
Copy Markdown
Collaborator Author

otbrown commented May 19, 2026

Rather than attempt to post a thumbs up on each comment while on train WiFi, I'll just comment thanks @TysonRayJones here! I'm hoping to give this branch some proper attention on Thursday/Friday.

@TysonRayJones
Copy link
Copy Markdown
Member

No prob! They're almost all nits I can sweep through rapidly. You can flag any you disagree with or for which there's nuance, and I can address the remainder next time I'm on QuEST. (No need to co-author the squash with me for my minor and mostly stylistic changes!)

without triggering an internal error
replacing the original internal error. Note that all of the other MPI functions between comm_config.cpp and comm_subroutines.cpp are unguarded; we should create a macro around them
- added comm_isActive to indicate whether QuEST is using MPI (which is distinct from whether MPI itself is initialised)
- renamed comm_isInit to comm_isMpiInit, since it queries MPI directly/globally, and when true, does not indicate whether QuEST is actually using MPI
- record isMpiUserOwned within comm_config.cpp, since failed-validation must not kill user-owned MPI, and it must know user-ownership before QuESTEnv succeeds/records it (because validation can fail DURING QuESTEnv initialisation)
- explicitly divided (through doc) comm_config.cpp into things which query MPI globally, and thinks which query only QuEST's MPI env/communicator
-
as found by Codex! All hail our new overlords
@TysonRayJones
Copy link
Copy Markdown
Member

I'll fix the validation issues, and move the API function; I think it doesn't belong in environment.h which is concerned with managing the QuESTEnv type. The most appropriate "long term" location is debug.cpp, which contains other controls of QuEST behaviour, such as validation precision, report lengths, etc. However, if we believe this facility/interface will be changed/extended in the future, we could presently move it into experimental.h. Recall that this is "nuisance free"; if we later decide to retain this API without change, moving it from experimental.h to debug.h has no change to UX. Is that all okay?

Copy link
Copy Markdown
Member

@TysonRayJones TysonRayJones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'll address these in #767)

Comment thread docs/cmake.md Outdated
Comment thread docs/cmake.md Outdated
| `USER_SOURCE_NAMES` | (Undefined), String | The source file for a user program which will be compiled alongside QuEST. `USER_OUTPUT_EXE_NAME` *must* also be defined. |
| `USER_OUTPUT_EXE_NAME` | (Undefined), String | The name of the executable which will be created from the provided `USER_SOURCE_NAMES`. `USER_SOURCE_NAMES` *must* also be defined. |

| `QUEST_GPU_NUM_THREADS_PER_BLOCK` | (128), Number | The default number of threads per block QuEST will use when offloading to a GPU. *Must* be a multiple of 32. For AMD GPUs this *should* be a multiple of 64. |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further, what is the motivation of making this variable a CMake/compile-time constant? The fact it can be runtime overriden means we don't gain any performance benefit. So we can penalty-free improve the flexibility by making this instead an environment variable, just like QUEST_DEFAULT_VALIDATION_EPSILON, so that changing it doesn't require recompilation

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See reply on #767 -- the flexibility is a convenience, the compilation is the important bit 😅

Comment thread quest/include/environment.h Outdated
Comment thread quest/include/environment.h Outdated
Comment thread quest/src/api/environment.cpp Outdated
{"isOmpCompiled", cpu_isOpenmpCompiled()},
{"isCuQuantumCompiled", gpu_isCuQuantumCompiled()},
{"isGpuCompiled", gpu_isGpuCompiled()},
{"isHipCompiled", gpu_isHipCompiled()},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea to introduce this, but I don't like that the binary is HIP-focused. We can replace isHipCompiled with "gpuPlatform", with options "CUDA" or "HIP"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the binary of isHipCompiled would be useful to attach directly to QuESTEnv too, like James did for isMpiGpuAware

Comment thread CMakeLists.txt Outdated
Comment thread CMakeLists.txt Outdated
globalEnvPtr->isMultithreaded,
globalEnvPtr->isDistributed,
globalEnvPtr->userOwnsMpi,
numThreads,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can print numGpuThreadsPerBlock here (and rename threads to cpuThreads)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(holding off on this; everything else reported here is [at least intendedly] fixed over runtime. The semi-exception is the reported num-threads, but users cannot modify that via QuEST: instead they'd need to use the OMP runtime. we can return and add gpuNumThreads when we sort out making cpuNumThreads runtime variable)

Comment thread tests/unit/environment.cpp Outdated
Comment thread tests/unit/environment.cpp Outdated
Woof that's a lot of boilerplate - but at least we have the safest environment variables in the business! 😅
since it should be added in a separate PR with the other intendedly programmatically-accessible fields. I know in my heart of hearts that if I left isHipCompiled attached, the other fields would never follow hehe
which is now the lowest-priority default, overridden at executable launch via the environment variable, in-turn overridden at runtime using the setter
given it can now be a result of the cmake var, the env var, or the runtime setter argument
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants