8373100: Genshen: Control thread can miss allocation failure notification#28665
8373100: Genshen: Control thread can miss allocation failure notification#28665earthling-amzn wants to merge 4 commits intoopenjdk:masterfrom
Conversation
|
👋 Welcome back wkemper! A progress list of the required criteria for merging this PR into |
|
@earthling-amzn This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been no new commits pushed to the ➡️ To integrate this PR with the above commit message to the |
|
@earthling-amzn The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
|
|
||
| // Notifies the control thread, but does not update the requested cause or generation. | ||
| // The overloaded variant should be used when the _control_lock is already held. | ||
| void notify_cancellation(GCCause::Cause cause); |
There was a problem hiding this comment.
These methods were the root cause here. ShenandoahHeap::_canceled_gc is read/written atomically, but ShenandoahGenerationalControlThread::_requested_gc_cause is read/written under a lock. These notify_cancellation methods did not update _requested_gc_cause at all. So, in the failure I observed we had:
- Control thread finishes cycle and sees no cancellation is requested (no lock used).
- Mutator thread fails allocation, cancels GC (again, no lock used), and does not change
_requested_gc_cause. - Control thread takes
_control_lockand checks_requested_gc_causeand sees_no_gc(becausenotify_cancellationdidn't change it) andwaitsforever now.
The fix here is to replace notify_cancellation with notify_control_thread which serializes updates to _requested_gc_cause under _control_lock.
There was a problem hiding this comment.
I was looking at the places where ShenandoahHeap::clear_cancelled_gc is called, I feel the problem is more likely from op_final_update_refs:
void ShenandoahConcurrentGC::op_final_update_refs() {
ShenandoahHeap* const heap = ShenandoahHeap::heap();
...
...
// Clear cancelled GC, if set. On cancellation path, the block before would handle
// everything.
if (heap->cancelled_gc()) {
heap->clear_cancelled_gc();
}
...
...
}
Let's say there is concurrent GC running, right before the final update refs safepoint, there is mutator allocation failure:
- The mutator tries to cancel the the concurrent GC and notify controller thread.
- The mutator block itself at
_alloc_failure_waiters_lock, claiming safepoint safe as well. - concurrent GC enter the final update refs (VM operation)
- in final update refs, VMThread sees cancelled_gc and clear it.
- concurrent GC finishes, but cancelled_gc has been cleared so it won't notify the mutator.
The fix seems to work in generational mode, but may not work in non-generational mode.
There was a problem hiding this comment.
While I was staring at the code ShenandoahController::handle_alloc_failure today, I found there is discrepancy between ShenandoahGenerationalControlThread and ShenandoahControlThread, I created a bug to unify the behavior, we could fix the issue in ShenandoahControlThread there.
There was a problem hiding this comment.
The scenario I described wasn't supposition, that is actually what happened in the debugger. The scenario you describe with op_final_update_refs would also be fixed by this PR. The _requested_gc_cause field should always be accessed under a lock. The code change here fixes an issue where an allocation failure might not set _requested_gc_cause at all.
There was a problem hiding this comment.
Yes, I understand the fix will solve the issue for genshen and also fix scenario I described.
I'll solve the potential issue in non-generational Shenandoah in the PR to fix the behavior differences in Genshen and non-generational Shenandoah.
kdnilsen
left a comment
There was a problem hiding this comment.
Thanks for diligent testing and analysis. Subtle code here.
pengxiaolong
left a comment
There was a problem hiding this comment.
Thanks for the digging and fixing the issue.
ysramakrishna
left a comment
There was a problem hiding this comment.
Thanks for cleaning this up.
Did you review the non-generational ShenandoahControlThread and uses thereof to make sure a similar issues doesn't exist there?
As Xiaolong states, it might be worthwhile to do a refactor that shares as much as needed and no more, and to do so cleanly.
This looks good; sorry for the delay in reviewing.
🚢
| void notify_control_thread(GCCause::Cause cause, ShenandoahGeneration* generation); | ||
| void notify_control_thread(MonitorLocker& ml, GCCause::Cause cause, ShenandoahGeneration* generation); | ||
|
|
||
| // Notifies the control thread, but does not update the requested cause or generation. | ||
| // The overloaded variant should be used when the _control_lock is already held. | ||
| void notify_cancellation(GCCause::Cause cause); | ||
| void notify_cancellation(MonitorLocker& ml, GCCause::Cause cause); | ||
| void notify_control_thread(GCCause::Cause cause); | ||
| void notify_control_thread(MonitorLocker& ml, GCCause::Cause cause); |
There was a problem hiding this comment.
Nit:
I'd (subjectively) order them thus: (nct = notify_control_thread)
- nct(cause)
- nct(ml, cause)
- nct(cause, generation)
- nct(ml, cause, generation)
For completeness in the documentation comment preceding, state that if an argument, cause or generation, is missing, it isn't updated.
I am assuming that there is a specific small subset of cause values where the generation isn't important to spell out and really implies "isn't necessary or is implicitly understood" for cancellation/request cause? Is there a call argument/consistency check that might be done in the nct:s where these bottom out to confirm this, or am I being unnecessarily paranoid?
There was a problem hiding this comment.
Yes, there are two uses where we don't need the generation:
- It's important to not update the generation for an allocation failure (degenerated cycle needs to use same generation)
- We are shutting down the JVM and don't want to start another cycle.
All cases need to pass a GCCause.
|
@ysramakrishna , @pengxiaolong - The non-generational control thread is less susceptible to this sort of issue because it has the responsibility of evaluating trigger conditions. It's loop therefore sleeps with a timed wait when the GC cycle is complete. If it misses a cancelled gc request, it will see it on the next iteration. |
|
/integrate |
|
Going to push as commit ea6493c.
Your commit was automatically rebased without conflicts. |
|
@earthling-amzn Pushed as commit ea6493c. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
In some cases, the control thread may fail to observe an allocation failure. This results in the thread which failed to allocate waiting forever for the control thread to run a cycle. Depending on which thread fails to allocate, the process may not make progress.
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28665/head:pull/28665$ git checkout pull/28665Update a local copy of the PR:
$ git checkout pull/28665$ git pull https://git.openjdk.org/jdk.git pull/28665/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 28665View PR using the GUI difftool:
$ git pr show -t 28665Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28665.diff
Using Webrev
Link to Webrev Comment