Adding rocAL+rocJPEG decode performance harness#474
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a manual performance harness for measuring rocAL image decode throughput (rocJPEG vs TurboJPEG) in multi-GPU sharded workloads, and updates rocAL’s rocJPEG path to optionally split decode work across multiple dedicated rocJPEG decoder instances.
Changes:
- Add a new manual benchmark folder (
tests/cpp_api/rocjpeg_decode_perf/) with C++/Python runners and reporting scripts for repeatable on/off comparisons. - Enhance rocAL’s rocJPEG decode implementation to optionally shard a batch across up to 4 dedicated rocJPEG decoder instances using OpenMP.
- Update
dataloader_multithreadto support configurable CPU thread count and an “effective batch size” for rocJPEG split-path benchmarking.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/cpp_api/rocjpeg_decode_perf/run_tests_twice_solution_on_off.sh | Driver script to run C++/Python benchmarks with rocJPEG split on/off plus TurboJPEG baselines. |
| tests/cpp_api/rocjpeg_decode_perf/rocal_decode_call_bench.py | Python rocAL decode benchmark with optional multi-process shard execution and timing extraction. |
| tests/cpp_api/rocjpeg_decode_perf/reporting_test_results.sh | Parses rocAL C++/Python logs and summarizes per-shard decode times and computed speedups. |
| tests/cpp_api/rocjpeg_decode_perf/reporting_perf_sharded_results.sh | Parses sharded jpegdecodeperf logs and summarizes per-GPU decode results. |
| tests/cpp_api/rocjpeg_decode_perf/README.md | Documents benchmark purpose, required env, workflows, and log/report generation. |
| tests/cpp_api/rocjpeg_decode_perf/perf_sharded_launcher.cpp | C++ helper to shard a dataset via symlinks and launch jpegdecodeperf per GPU with logs. |
| tests/cpp_api/dataloader_multithread/dataloader_multithread.cpp | Adds CPU thread-count arg and adjusts effective batch sizing/output handling for rocJPEG split benchmarking. |
| rocAL/source/loaders/image/image_read_and_decode.cpp | Implements optional rocJPEG dedicated OpenMP split path with multiple decoder instances and per-shard decode. |
| rocAL/include/loaders/image/image_read_and_decode.h | Adds state for multiple rocJPEG decoders, sub-batch sizes, and split toggle flag. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Harden the sharded launcher work directory cleanup, make symlink names collision-resistant, validate dataloader CPU thread count, improve Python decoded image accounting, and use spawn for multi-shard benchmark workers.
| std::strcmp(rocjpeg_omp_split_env, "FALSE") == 0 || | ||
| std::strcmp(rocjpeg_omp_split_env, "false") == 0)); | ||
| if (_use_rocjpeg_dedicated_omp_split) { | ||
| const size_t rocjpeg_decoder_count = std::min(static_cast<size_t>(batch_size), std::max(static_cast<size_t>(1), std::min(static_cast<size_t>(4), _num_threads))); |
There was a problem hiding this comment.
what if the _num_threads < 4 set by user?
There was a problem hiding this comment.
This is already bounded by _num_threads. The decoder count is computed as min(batch_size, max(1, min(4, _num_threads))), so if the user configures fewer than 4 CPU threads, rocAL creates only that many rocJPEG decoder instances. For example, _num_threads=2 creates 2 decoder instances; _num_threads=1 creates 1. The hard cap of 4 only applies when the user provides 4 or more CPU threads.
| const size_t base_sub_batch = static_cast<size_t>(batch_size) / rocjpeg_decoder_count; | ||
| const size_t sub_batch_remainder = static_cast<size_t>(batch_size) % rocjpeg_decoder_count; | ||
| for (size_t decoder_index = 0; decoder_index < rocjpeg_decoder_count; decoder_index++) { | ||
| const size_t sub_batch_size = base_sub_batch + ((decoder_index < sub_batch_remainder) ? 1 : 0); |
There was a problem hiding this comment.
this will make the sub-batch_size and odd number. Please check is this is OK for hardware decoder
There was a problem hiding this comment.
Odd sub-batch sizes should be OK. The existing single-decoder rocJPEG path already passes the user-provided rocAL batch size directly to rocJPEG, so odd batch sizes were already possible before this new split. In the new split path, each rocJPEG decoder owns its own streams and is initialized with its own sub-batch size. The split logic only divides the same total batch across decoder instances; it does not introduce a new odd-size requirement that did not already exist.
| _rocjpeg_decoder = create_decoder(decoder_config); | ||
| _rocjpeg_decoder->initialize(device_id, batch_size); | ||
| const char *rocjpeg_omp_split_env = std::getenv("ROCAL_ROCJPEG_DEDICATED_OMP_SPLIT"); | ||
| _use_rocjpeg_dedicated_omp_split = !(rocjpeg_omp_split_env && |
There was a problem hiding this comment.
suggest having a helper function for this since this is duplicated in multiple files. Eventually we need to get rid of this. So it is OK to check for just 0/1 for simplicity. No need for OFF and False etc. Make the function case insensitive like below
static bool env_flag_disabled(const char* name) {
const char* val = std::getenv(name);
if (!val || val[0] == '\0') return false;
std::string s(val);
std::transform(s.begin(), s.end(), s.begin(),
[](unsigned char c) { return std::tolower(c); });
return s == "0" || s == "no";
}
There was a problem hiding this comment.
Addressed, in commit: eb7750c. I added a small case-insensitive env_flag_disabled() helper and replaced the duplicated env parsing in both the rocAL loader path and the dataloader benchmark. The helper keeps the split path enabled by default and disables it only for 0 or no, matching the benchmark scripts' 0/1 usage.
| #pragma omp parallel for num_threads(rocjpeg_decoder_threads) | ||
| for (size_t shard = 0; shard < _rocjpeg_decoders.size(); shard++) { | ||
| #if ENABLE_HIP | ||
| hipError_t hip_status = hipSetDevice(_device_id); |
There was a problem hiding this comment.
There is a duplicate hipSetDevice on line #370. Why this is required here
There was a problem hiding this comment.
This is intentional. The outer hipSetDevice sets the device for the load routine thread, but the split path runs rocJPEG work inside OpenMP worker threads. HIP current device is thread-local, so each worker needs to set the device before invoking rocJPEG/HIP-backed work. I added a comment to clarify this at line number 395 in commit code: eb7750c.
| @@ -350,53 +374,132 @@ ImageReadAndDecode::load(unsigned char *buff, | |||
| _set_device_id = true; | |||
There was a problem hiding this comment.
Addressed in commit: eb7750c. I replaced the if (!A && !B) ... else if (A || B) condition with a named is_rocjpeg_decoder boolean and a plain if/else. This keeps the same behavior but makes the two decode paths clearer: non-rocJPEG decoders use the existing per-image path, and rocJPEG/rocJPEG cropped use the batched rocJPEG path.
| if (hip_status != hipSuccess) { | ||
| THROW("hipSetDevice failed inside rocJPEG shard worker"); | ||
| } | ||
| #endif |
There was a problem hiding this comment.
suggest putting all the decode_info calls for a shard within while loop. I think the current logic is substituting failed image with other image from batch. THis has to be done at the end of decoding by padding or something. Otherwise we are sending the same frame to decode twice to rocJpeg
There was a problem hiding this comment.
Addressed in this commit: c0526b7.
I refactored the rocJPEG split-path decode-info logic so the original candidate and fallback candidates are handled in a single while loop per shard item. This keeps the same fallback/substitution behavior but avoids having one decode_info call outside the retry loop and another inside it.
Latest Code Change Test SummaryTest ConfigurationAMD-SMI 26.2.2+671d39a71e Dataset: ImageNet 5 Classes Decoded Image Count
C++ rocAL Sample Decode-Time Results
Python rocAL Benchmark Decode-Time Results
rocAL Patch Solution Enhancement
|
| int j = static_cast<int>(shard_end) - 1; | ||
| while (j >= static_cast<int>(shard_begin)) { | ||
| if (rocjpeg_decoder->decode_info(_compressed_buff[j].data(), _actual_read_size[j], &original_width, &original_height, | ||
| &decoded_width, &decoded_height, |
There was a problem hiding this comment.
Throw inside an openmp pragma can produce unwanted results. Probably you need to catch the return code and throw exception after all threads are done
There was a problem hiding this comment.
Addressed in this commit: c0526b7.
I removed the THROW calls from inside the OpenMP worker loop. Worker failures are now recorded through a shared error flag/message protected by an OpenMP critical section, and the main thread throws once after the parallel region completes.
| for multi-GPU sharded decode experiments, comparing the rocAL + rocJPEG path | ||
| with the dedicated OpenMP split enabled and disabled. | ||
|
|
||
| Suggested location in rocAL: |
There was a problem hiding this comment.
I don't understand what is all these tests and scripts used for. Let's discuss
rrawther
left a comment
There was a problem hiding this comment.
please address review comments
The Added Test Script FilesThese files are not intended to be regular correctness/unit tests. They are a manual performance harness for validating the rocJPEG split-decoder change in rocAL. The rocAL code change affects how rocJPEG decode work is scheduled internally: instead of using one rocJPEG decoder instance for the full batch, the split path can use multiple dedicated rocJPEG decoder instances and divide the batch across them. To validate that type of change, we need more than a normal pass/fail test; we need a repeatable way to compare decode timing with the split path ON and OFF across C++ and Python rocAL entry points. The scripts are organized as follows:
This folder gives developers a reproducible workflow for answering:
This is why the folder is under |
Motivation
This PR adds a rocAL-focused performance harness for validating and measuring rocJPEG-backed image decode behavior, especially for multi-GPU sharded decode workloads.
The main goals are:
jpegdecodeperfacross GPU shards, making it easier to compare rocAL decode results against rocJPEG sample-level performance.dataloader_multithreadtest app so it can drive the new rocJPEG split-path benchmarking with configurable CPU thread count and effective batch sizing.Technical Details
This PR adds the main rocAL decode enhancement being measured by the new harness: rocJPEG decode work can now be split across multiple dedicated rocJPEG decoder instances instead of sending the whole batch through one shared rocJPEG decoder.
The rocJPEG dedicated OpenMP split path is enabled by default. With the default behavior, rocAL creates up to four rocJPEG decoder instances, bounded by the configured CPU thread count and batch size. The input batch is divided into per-decoder sub-batches, and OpenMP dispatches those sub-batches across the dedicated decoder workers. This allows the benchmark configuration of four CPU threads to use four rocJPEG decoder instances, reducing contention around one decoder and improving decode throughput for sharded/multi-GPU image loading workloads.
To compare against the previous behavior, set
ROCAL_ROCJPEG_DEDICATED_OMP_SPLIT=0. In that mode, rocAL keeps the previous single-decoder rocJPEG path, so the benchmark scripts can compare the old and new behavior directly.This PR adds
tests/cpp_api/rocjpeg_decode_perf/as a manual performance harness for the rocJPEG split-decoder change. These scripts are not regular CTest unit tests; they are intended for explicit developer/reviewer runs on systems with a suitable dataset and GPU configuration.The harness is needed because the change is performance-sensitive. A correctness-only test would not show whether splitting rocJPEG work across multiple decoder instances improves decode time or whether ON/OFF behavior remains comparable.
The harness provides:
dataloader_multithread.fn.readers.fileandfn.decoders.image.ROCAL_ROCJPEG_DEDICATED_OMP_SPLIT=0and=1.jpegdecodeperflauncher for rocJPEG sample-level comparison.This PR adds a new manual benchmark/support folder:
New files:
The new
README.mddocuments the benchmark purpose, required environment variables, common workflow, log locations, and example commands for PR reviewers or developers running the tests manually.run_tests_twice_solution_on_off.shis the main rocAL comparison driver. It runs six benchmark cases:The script uses:
to toggle the dedicated split path for rocJPEG runs. It writes logs under configurable
LOG_DIR, defaulting to:The script requires only the machine-specific inputs to be exported:
and supports optional:
reporting_test_results.shparses the logs produced byrun_tests_twice_solution_on_off.sh. It summarizes:rocal_decode_call_bench.pyis a Python rocAL decode benchmark. It builds a simplereaders.file+decoders.imagepipeline and supports:pipe.timing_info()perf_sharded_launcher.cppis a standalone helper launcher for rocJPEGjpegdecodeperf. It:.jpgand.jpegfilesjpegdecodeperfprocess per GPUjpegdecodeperf_gpu<N>.logreporting_perf_sharded_results.shparses the logs produced byperf_sharded_launcher.cpp. It reports:Changes in:
include:
_use_rocjpeg_dedicated_omp_split, controlled byROCAL_ROCJPEG_DEDICATED_OMP_SPLIT.This PR also updates:
to support this benchmark path by:
cpu_thread_countargumentcpu_thread_countintorocalCreateROCAL_ROCJPEG_DEDICATED_OMP_SPLITTest Plan
Lightweight validation was run after applying the changes:
Manual benchmark workflow documented in the README:
Standalone
jpegdecodeperfworkflow documented in the README:Test Result
The following local checks passed:
reporting_perf_sharded_results.shreporting_test_results.shrun_tests_twice_solution_on_off.shperf_sharded_launcher.cppcompiled successfully withg++ -std=c++17 -Wall -Wextra -pedantic.rocal_decode_call_bench.pypassed Python bytecode compilation withpython3 -m py_compile.The benchmark harness was added specifically to report the effect of this change by comparing:
against:
for both C++ and Python rocAL decode paths.
No full hardware benchmark results are included in this PR note because the added harness is intended to support manual performance validation on systems with the target ROCm/rocAL/rocJPEG installation, dataset, and GPU configuration.