Skip to content

Conversation

@artem-lunarg
Copy link
Contributor

@artem-lunarg artem-lunarg commented Jan 16, 2026

Instead, remove substates when the object is actually destructed (in the destructor).

This matches the pre-substate behavior, where a state object could retain some data after Destroy (for example, syncval uses this for error messages).

In the post-substate version, Destroy removes registered substates. In the case of syncval, this results in losing resource handle information during submit time validation. Originally syncval command buffer state object stored the list of handles which was referenced in case of error (even if command buffer was destroyed).

Closes #11490

Instead, remove substates when the object is actually destructed
(in the destructor).

This matches the pre-substate behavior, where a state object could
retain some data after Destroy (for example, syncval uses this for
error messages). This assumes that there is shared ptr reference that
keeps the object alive.

In the post-substate version, Destroy removes registered substates,
and the above scenario does not work. In the case of syncval, this
results in losing resource handle information during submit time
validation. Originally syncval command buffer state object stored
the list of handles and QueueBatchContext held share_ptr to submitted
command buffers.
@ci-tester-lunarg
Copy link
Collaborator

CI Vulkan-ValidationLayers build queued with queue ID 624744.

@artem-lunarg
Copy link
Contributor Author

Still trying to come up with a test that reproduces that issue, but not that easy... (tested directly on the app)

@ci-tester-lunarg
Copy link
Collaborator

CI Vulkan-ValidationLayers build # 22191 running.

@ci-tester-lunarg
Copy link
Collaborator

CI Vulkan-ValidationLayers build # 22191 passed.

@artem-lunarg
Copy link
Contributor Author

artem-lunarg commented Jan 16, 2026

I definitely want to reproduce this with a test. I understand that this error happens due to non synchronized writes from two queues (at least that's how vvl detects this) but the crash happens because command buffer is deleted (deletion looks correct, no in-use error). In order to delete command buffer you have to wait for it. Waiting for command buffer prevents the hazard.. I have a reproduciable gfxr capture, so can figure this out sooner or later.

for (auto &item : sub_states_) {
item.second->Destroy();
}
sub_states_.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about in Pipeline::Destroy()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So currently we have use case with command buffers, it depends on the logic probably if other state object might need this (command buffers we referenced via shares ptr and used after deletion). This results in hard crash so if other objects have such scenarios it will be hard to miss

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we now not clear here, but only for pipeline?

I can't tell if this is an issue due to just how Command Buffers (and queues) are the only state we have that we don't derive from CoreChecks

Is it because these are dispatchable handles

I guess if we are clearing it in Pipelines, but not here, curious what is the "rule" to follow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ideally to have logic that doesn’t need this but syncval relied on this before, so this just gets old behavior (that’s regression after substates)

Copy link
Contributor Author

@artem-lunarg artem-lunarg Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we now not clear here, but only for pipeline?

Ah, I see, indeed. Not sure why we cleared only for command buffers and pipelines and not for others.

@artem-lunarg
Copy link
Contributor Author

Reproduced the original crash with a test.

Now need to figure out if the scenario from the test is a valid behavior or not (even after this fix). If it's a valid behavior then need to fix additional false positive issue (can be as a separate PR).

@MennoVink
Copy link

MennoVink commented Jan 20, 2026

vkQueueSubmit(): WRITE_RACING_WRITE hazard detected. vkCmdPipelineBarrier2KHR[Advect] (from VkCommandBuffer 0x2395bbd1720 submitted on the current VkQueue 0x2394c2385d0[Graphics Queue 0]) writes to VkImage 0x7820000000782, which was previously written during an image layout transition initiated by another vkCmdPipelineBarrier2KHR command (from VkCommandBuffer 0x23956d9f830 submitted on VkQueue 0x2394eb82320[Transfer Queue 0]). 
The current synchronization allows VK_ACCESS_2_TRANSFER_READ_BIT accesses at VK_PIPELINE_STAGE_2_COPY_BIT|VK_PIPELINE_STAGE_2_RESOLVE_BIT|VK_PIPELINE_STAGE_2_BLIT_BIT|VK_PIPELINE_STAGE_2_CONVERT_COOPERATIVE_VECTOR_MATRIX_BIT_NV, VK_ACCESS_2_INDIRECT_COMMAND_READ_BIT accesses at VK_PIPELINE_STAGE_2_COPY_INDIRECT_BIT_KHR, but layout transition does synchronize with these accesses.
Vulkan insight: If the layout transition is done via an image barrier, ensure srcStageMask and srcAccessMask synchronize with the accesses mentioned above. If the transition occurs as part of the render pass begin operation, consider specifying an external subpass dependency (VK_SUBPASS_EXTERNAL) with srcStageMask and srcAccessMask that synchronize with those accesses, or perform the transition in a separate image barrier before the render pass begins.

This is from a different example

Is this the non synchronized writes issue you're talking about? Could this be because QFOT not being supported by syncval? #8112.
It seems to validate against the transfer queue's release barrier. At the very least there's a semaphore into compute/graphics queue acquire barrier into fence in between.
Perhaps this is just a textual error (it sees the release barrier on transfer queue first, and doesn't override it by release barrier on graphics queue). If so then the question becomes if we're able to get WAW errors if we've fenced on the command buffer that semaphored on the command buffer that did the previous write.

The interesting part here is that my acquire barrier uses AllCommands + MemoryRead. I can imagine this could be specified somewhere as invalid, because i'm going to be writing to the image even though it hasn't been made ready for writing yet.

@artem-lunarg
Copy link
Contributor Author

artem-lunarg commented Jan 20, 2026

@MennoVink I found the root cause of the issue, that's in syncval internals (we also have internal test that reproduces the same behavior as in the app). Now we need to fix one part of the validation. The crash itself is not the root cause (e.g. that there is no check for null at that point). If false positive is fixed, that crash will go away and also no false positive error message. I'm trying to come up with a solution during this week.

p.s. that's not about QFOT mostly about internal mechanism to track accesses from different queues in general.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1.4.335.0 syncval crash

5 participants