8373100: Genshen: Control thread can miss allocation failure notification by earthling-amzn · Pull Request #28665 · openjdk/jdk

earthling-amzn · 2025-12-04T20:35:42Z

In some cases, the control thread may fail to observe an allocation failure. This results in the thread which failed to allocate waiting forever for the control thread to run a cycle. Depending on which thread fails to allocate, the process may not make progress.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8373100: Genshen: Control thread can miss allocation failure notification (Bug - P4)

Reviewers

Kelvin Nilsen (@kdnilsen - Committer) Review applies to 1081f21e
Xiaolong Peng (@pengxiaolong - Committer) Review applies to 1081f21e
Y. Srinivas Ramakrishna (@ysramakrishna - Reviewer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28665/head:pull/28665
$ git checkout pull/28665

Update a local copy of the PR:
$ git checkout pull/28665
$ git pull https://git.openjdk.org/jdk.git pull/28665/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 28665

View PR using the GUI difftool:
$ git pr show -t 28665

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28665.diff

Using Webrev

Link to Webrev Comment

…ications

bridgekeeper · 2025-12-04T20:37:44Z

👋 Welcome back wkemper! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-12-04T20:38:10Z

@earthling-amzn This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8373100: Genshen: Control thread can miss allocation failure notification

Reviewed-by: ysr, kdnilsen, xpeng

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

openjdk · 2025-12-04T20:39:01Z

@earthling-amzn The following labels will be automatically applied to this pull request:

hotspot-gc
shenandoah

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

mlbridge · 2025-12-04T20:42:27Z

Webrevs

kdnilsen

Thanks.

earthling-amzn · 2025-12-05T18:47:56Z

-
-  // Notifies the control thread, but does not update the requested cause or generation.
-  // The overloaded variant should be used when the _control_lock is already held.
-  void notify_cancellation(GCCause::Cause cause);


These methods were the root cause here. ShenandoahHeap::_canceled_gc is read/written atomically, but ShenandoahGenerationalControlThread::_requested_gc_cause is read/written under a lock. These notify_cancellation methods did not update _requested_gc_cause at all. So, in the failure I observed we had:

Control thread finishes cycle and sees no cancellation is requested (no lock used).

Mutator thread fails allocation, cancels GC (again, no lock used), and does not change _requested_gc_cause.

Control thread takes _control_lock and checks _requested_gc_cause and sees _no_gc (because notify_cancellation didn't change it) and waits forever now.

The fix here is to replace notify_cancellation with notify_control_thread which serializes updates to _requested_gc_cause under _control_lock.

I was looking at the places where ShenandoahHeap::clear_cancelled_gc is called, I feel the problem is more likely from op_final_update_refs:

void ShenandoahConcurrentGC::op_final_update_refs() { ShenandoahHeap* const heap = ShenandoahHeap::heap(); ... ... // Clear cancelled GC, if set. On cancellation path, the block before would handle // everything. if (heap->cancelled_gc()) { heap->clear_cancelled_gc(); } ... ... }

Let's say there is concurrent GC running, right before the final update refs safepoint, there is mutator allocation failure:

The mutator tries to cancel the the concurrent GC and notify controller thread.

The mutator block itself at _alloc_failure_waiters_lock, claiming safepoint safe as well.

concurrent GC enter the final update refs (VM operation)

in final update refs, VMThread sees cancelled_gc and clear it.

concurrent GC finishes, but cancelled_gc has been cleared so it won't notify the mutator.

The fix seems to work in generational mode, but may not work in non-generational mode.

While I was staring at the code ShenandoahController::handle_alloc_failure today, I found there is discrepancy between ShenandoahGenerationalControlThread and ShenandoahControlThread, I created a bug to unify the behavior, we could fix the issue in ShenandoahControlThread there.

The scenario I described wasn't supposition, that is actually what happened in the debugger. The scenario you describe with op_final_update_refs would also be fixed by this PR. The _requested_gc_cause field should always be accessed under a lock. The code change here fixes an issue where an allocation failure might not set _requested_gc_cause at all.

Yes, I understand the fix will solve the issue for genshen and also fix scenario I described.
I'll solve the potential issue in non-generational Shenandoah in the PR to fix the behavior differences in Genshen and non-generational Shenandoah.

kdnilsen

Thanks for diligent testing and analysis. Subtle code here.

pengxiaolong

Thanks for the digging and fixing the issue.

ysramakrishna

Thanks for cleaning this up.

Did you review the non-generational ShenandoahControlThread and uses thereof to make sure a similar issues doesn't exist there?

As Xiaolong states, it might be worthwhile to do a refactor that shares as much as needed and no more, and to do so cleanly.

This looks good; sorry for the delay in reviewing.

🚢

ysramakrishna · 2025-12-12T23:57:50Z

  void notify_control_thread(GCCause::Cause cause, ShenandoahGeneration* generation);
  void notify_control_thread(MonitorLocker& ml, GCCause::Cause cause, ShenandoahGeneration* generation);
-
-  // Notifies the control thread, but does not update the requested cause or generation.
-  // The overloaded variant should be used when the _control_lock is already held.
-  void notify_cancellation(GCCause::Cause cause);
-  void notify_cancellation(MonitorLocker& ml, GCCause::Cause cause);
+  void notify_control_thread(GCCause::Cause cause);
+  void notify_control_thread(MonitorLocker& ml, GCCause::Cause cause);


Nit:

I'd (subjectively) order them thus: (nct = notify_control_thread)

nct(cause)

nct(ml, cause)

nct(cause, generation)

nct(ml, cause, generation)

For completeness in the documentation comment preceding, state that if an argument, cause or generation, is missing, it isn't updated.

I am assuming that there is a specific small subset of cause values where the generation isn't important to spell out and really implies "isn't necessary or is implicitly understood" for cancellation/request cause? Is there a call argument/consistency check that might be done in the nct:s where these bottom out to confirm this, or am I being unnecessarily paranoid?

Yes, there are two uses where we don't need the generation:

It's important to not update the generation for an allocation failure (degenerated cycle needs to use same generation)

We are shutting down the JVM and don't want to start another cycle.

All cases need to pass a GCCause.

earthling-amzn · 2025-12-13T00:16:06Z

@ysramakrishna , @pengxiaolong - The non-generational control thread is less susceptible to this sort of issue because it has the responsibility of evaluating trigger conditions. It's loop therefore sleeps with a timed wait when the GC cycle is complete. If it misses a cancelled gc request, it will see it on the next iteration.

earthling-amzn · 2025-12-15T15:50:03Z

/integrate

openjdk · 2025-12-15T15:52:02Z

Going to push as commit ea6493c.
Since your change was applied there have been 20 commits pushed to the master branch:

34f2413: 8371503: RETAIN_IMAGE_AFTER_TEST do not work for some tests
1f47294: 8287062: com/sun/jndi/ldap/LdapPoolTimeoutTest.java failed due to different timeout message
f5187eb: 8373599: Cleanup arguments.hpp includes
... and 17 more: https://git.openjdk.org/jdk/compare/23c39757ecdc834c631f98f4487cfea21c9b948b...master

Your commit was automatically rebased without conflicts.

openjdk · 2025-12-15T15:52:10Z

@earthling-amzn Pushed as commit ea6493c.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Expand scope of control lock so that it can't miss cancellation notif…

89af170

…ications

openjdk Bot added hotspot-gc hotspot-gc-dev@openjdk.org shenandoah shenandoah-dev@openjdk.org labels Dec 4, 2025

openjdk Bot added the rfr Pull request is ready for review label Dec 4, 2025

kdnilsen approved these changes Dec 4, 2025

View reviewed changes

earthling-amzn marked this pull request as draft December 5, 2025 15:41

openjdk Bot removed the rfr Pull request is ready for review label Dec 5, 2025

Set requested gc cause under a lock when allocation fails

1081f21

earthling-amzn commented Dec 5, 2025

View reviewed changes

earthling-amzn marked this pull request as ready for review December 5, 2025 18:49

openjdk Bot added the rfr Pull request is ready for review label Dec 5, 2025

kdnilsen approved these changes Dec 5, 2025

View reviewed changes

pengxiaolong approved these changes Dec 11, 2025

View reviewed changes

ysramakrishna approved these changes Dec 13, 2025

View reviewed changes

openjdk Bot added the ready Pull request is ready to be integrated label Dec 13, 2025

earthling-amzn added 2 commits December 12, 2025 16:33

Improve comment

baed458

Merge remote-tracking branch 'jdk/master' into fix-missed-cancellation

4c82d21

ysramakrishna approved these changes Dec 13, 2025

View reviewed changes

openjdk Bot added the integrated Pull request has been integrated label Dec 15, 2025

openjdk Bot closed this Dec 15, 2025

openjdk Bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Dec 15, 2025

Conversation

earthling-amzn commented Dec 4, 2025 • edited by openjdk Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Reviewing

Uh oh!

bridgekeeper Bot commented Dec 4, 2025

Uh oh!

openjdk Bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openjdk Bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mlbridge Bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

kdnilsen left a comment

Choose a reason for hiding this comment

Uh oh!

earthling-amzn Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

pengxiaolong Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pengxiaolong Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

earthling-amzn Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

pengxiaolong Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

kdnilsen left a comment

Choose a reason for hiding this comment

Uh oh!

pengxiaolong left a comment

Choose a reason for hiding this comment

Uh oh!

ysramakrishna left a comment

Choose a reason for hiding this comment

Uh oh!

ysramakrishna Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

earthling-amzn Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

earthling-amzn commented Dec 13, 2025

Uh oh!

earthling-amzn commented Dec 15, 2025

Uh oh!

openjdk Bot commented Dec 15, 2025

Uh oh!

openjdk Bot commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

earthling-amzn commented Dec 4, 2025 •

edited by openjdk Bot

Loading

openjdk Bot commented Dec 4, 2025 •

edited

Loading

openjdk Bot commented Dec 4, 2025 •

edited

Loading

mlbridge Bot commented Dec 4, 2025 •

edited

Loading

pengxiaolong Dec 10, 2025 •

edited

Loading