This issue is tracking performance improvements and investigations to Python-to-C binding overhead, mostly driven by the benchmark of cuTensorMapEncodeTiled devised in #659. That is a useful benchmark because it is a function with an unusually high number of arguments (and therefore unusually high Python-to-C overhead).
Comparison to a more limited Cython binding
As an interesting experimental datapoint, a colleague provided a vibe-coded Cython binding for cuTensorMapEncodeTiled that runs about 4x faster than cuda-bindings official one. It is useful to see where some overheads may be reduced, but care should be taken looking at its raw performance: this wrapper accepts far fewer things as inputs than the CUDA bindings, and doesn't include developer niceties, like enums.
Merged or in-progress fixes
Timings below are per-iteration of the benchmark in #659. This includes /both/ binding overhead and some fixed amount of time in the actual CUDA call.
Under investigation
Issues in this category are theoretical findings to reduce the operations required for type conversion, but haven't necessarily yet been confirmed to have a measurable effect.
Deferred (effective, but high effort)
Rejected (ineffective)
This issue is tracking performance improvements and investigations to Python-to-C binding overhead, mostly driven by the benchmark of
cuTensorMapEncodeTileddevised in #659. That is a useful benchmark because it is a function with an unusually high number of arguments (and therefore unusually high Python-to-C overhead).Comparison to a more limited Cython binding
As an interesting experimental datapoint, a colleague provided a vibe-coded Cython binding for
cuTensorMapEncodeTiledthat runs about 4x faster thancuda-bindingsofficial one. It is useful to see where some overheads may be reduced, but care should be taken looking at its raw performance: this wrapper accepts far fewer things as inputs than the CUDA bindings, and doesn't include developer niceties, like enums.Merged or in-progress fixes
Timings below are per-iteration of the benchmark in #659. This includes /both/ binding overhead and some fixed amount of time in the actual CUDA call.
Under investigation
Issues in this category are theoretical findings to reduce the operations required for type conversion, but haven't necessarily yet been confirmed to have a measurable effect.
Deferred (effective, but high effort)
cdef classescan go through__new__#1643Rejected (ineffective)
FastEnums #1637