Data race in AdaptiveNonLocalMeansDenoisingImageFilter::ThreadedGenerateData (scatter write to m_RicianBiasImage)
itk::AdaptiveNonLocalMeansDenoisingImageFilter produces nondeterministic output under multi-threaded execution because ThreadedGenerateData scatter-writes a shared image at neighborhood-offset indices that cross work-unit boundaries.
Root cause
In Modules/Filtering/AdaptiveDenoising/include/itkAdaptiveNonLocalMeansDenoisingImageFilter.hxx, ThreadedGenerateData computes, for each center pixel, indices neighborhoodPatchIndex = centerIndex + neighborhoodPatchOffsetList[n] and writes the shared bias image at those neighbor locations:
IndexType neighborhoodPatchIndex = centerIndex + neighborhoodPatchOffsetList[n];
...
this->m_RicianBiasImage->SetPixel(neighborhoodPatchIndex, 0.0);
...
this->m_RicianBiasImage->SetPixel(neighborhoodPatchIndex, minimumDistance);
Because each work unit owns a contiguous output region but writes to centerIndex ± patchOffset, the written pixels fall into neighboring work units' regions. Multiple work units therefore perform unsynchronized SetPixel on the same shared m_RicianBiasImage pixels — a data race. The corrupted bias image then feeds the output in AfterThreadedGenerateData.
The race is benign at low thread contention (it almost never interleaves badly), so the test passes when run in isolation. Under CPU oversubscription — e.g. ctest -j8, where each of several concurrent tests spawns a full thread pool — preemption widens the race window and the output diverges from the baseline.
Reproduction
AdaptiveNonLocalMeansDenoisingImageFilterTest1 flakes under parallel load:
| Condition |
Result |
default work units, ctest -j8 |
fails 50-100% of runs (ImageError != 0) |
filter->SetNumberOfWorkUnits(1), ctest -j8 |
passes 100% (6/6), deterministic |
AdaptiveNonLocalMeansDenoisingImageFilterTest1 alone |
passes |
The current Baseline/r16denoised_*.nrrd baselines match the single-threaded (race-free) result, confirming the multi-threaded path is the one that is wrong.
Suggested fix (proper)
The scatter write should be made race-free in the filter, e.g. one of:
- Reformulate as a gather: for each owned output pixel, read its contributing neighbors instead of scatter-writing neighbors.
- Accumulate into per-work-unit buffers and merge in
AfterThreadedGenerateData.
- Restrict each work unit to write only pixels within its own output region.
Interim mitigation
A companion test-only change pins the test filter to a single work unit for determinism (SetNumberOfWorkUnits(1)), so CI stops flaking while the filter fix is pending. PR to follow; this issue tracks the underlying filter race.
Data race in
AdaptiveNonLocalMeansDenoisingImageFilter::ThreadedGenerateData(scatter write tom_RicianBiasImage)itk::AdaptiveNonLocalMeansDenoisingImageFilterproduces nondeterministic output under multi-threaded execution becauseThreadedGenerateDatascatter-writes a shared image at neighborhood-offset indices that cross work-unit boundaries.Root cause
In
Modules/Filtering/AdaptiveDenoising/include/itkAdaptiveNonLocalMeansDenoisingImageFilter.hxx,ThreadedGenerateDatacomputes, for each center pixel, indicesneighborhoodPatchIndex = centerIndex + neighborhoodPatchOffsetList[n]and writes the shared bias image at those neighbor locations:Because each work unit owns a contiguous output region but writes to
centerIndex ± patchOffset, the written pixels fall into neighboring work units' regions. Multiple work units therefore perform unsynchronizedSetPixelon the same sharedm_RicianBiasImagepixels — a data race. The corrupted bias image then feeds the output inAfterThreadedGenerateData.The race is benign at low thread contention (it almost never interleaves badly), so the test passes when run in isolation. Under CPU oversubscription — e.g.
ctest -j8, where each of several concurrent tests spawns a full thread pool — preemption widens the race window and the output diverges from the baseline.Reproduction
AdaptiveNonLocalMeansDenoisingImageFilterTest1flakes under parallel load:ctest -j8ImageError != 0)filter->SetNumberOfWorkUnits(1),ctest -j8AdaptiveNonLocalMeansDenoisingImageFilterTest1aloneThe current
Baseline/r16denoised_*.nrrdbaselines match the single-threaded (race-free) result, confirming the multi-threaded path is the one that is wrong.Suggested fix (proper)
The scatter write should be made race-free in the filter, e.g. one of:
AfterThreadedGenerateData.Interim mitigation
A companion test-only change pins the test filter to a single work unit for determinism (
SetNumberOfWorkUnits(1)), so CI stops flaking while the filter fix is pending. PR to follow; this issue tracks the underlying filter race.