Skip to content

[Unitrace] toggling PTI_ENABLE_COLLECTION after warmup aborts in cl_collector.h assertion #108

@NinaWie

Description

@NinaWie

I’m seeing a crash with unitrace when I enable collection only after warmup kernels. I'm trying to use unitrace for profiling an OCL kernel, and I would like to use --start-paused to profile only the main trials.
This is the command I used: PTI_ENABLE_COLLECTION=0 unitrace --opencl --chrome-kernel-logging --chrome-itt-logging --start-paused --metric-sampling --output-dir-path ./profiler_output -o profiler_output/trace python task.py

Working version:
It works fine if I enable profiling for all trials:
os.environ["PTI_ENABLE_COLLECTION"] = "1"
<warmup trials>
<real trials>
os.environ["PTI_ENABLE_COLLECTION"] = "0"

Version with error:
However, I would like to start profiling only after the warmup trials:
<warmup trials>
os.environ["PTI_ENABLE_COLLECTION"] = "1"
<real trials>
os.environ["PTI_ENABLE_COLLECTION"] = "0"

This results in the following error:

Error:
task.py::TestVectorAddOCL::test_benchmark1 python: /home/nwiedema/pti-gpu/tools/unitrace/src/opencl/cl_collector.h:1322: static void ClCollector::OnExitEnqueueKernel(cl_callback_data*, ClCollector*, uint64_t*) [with T = _cl_params_clEnqueueNDRangeKernel; cl_callback_data = _cl_callback_data; uint64_t = long unsigned int]: Assertion `*(params->event) != nullptr' failed.
Fatal Python error: Aborted

I only see this issue with opencl; for sycl (without the --opencl flag) it works fine. I also tried Temporal or Out-of-Application Control (with --session), but get the same error.

Environment:
latest unitrace version (built from main
GPU: ptl (but can reproduce on lnl)
OS: Linux

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions