Skip to content

SLING-13169 - Fix JobManager readiness condition to preserve topology on transient probe failures#54

Open
daniancu wants to merge 2 commits into
apache:masterfrom
daniancu:feature/SKYOPS-137756-fix-jobmanager-readiness-condition
Open

SLING-13169 - Fix JobManager readiness condition to preserve topology on transient probe failures#54
daniancu wants to merge 2 commits into
apache:masterfrom
daniancu:feature/SKYOPS-137756-fix-jobmanager-readiness-condition

Conversation

@daniancu
Copy link
Copy Markdown

@daniancu daniancu commented Apr 8, 2026

Summary

  • Remove stopProcessing() from unbindJobProcessingEnabledCondition() so topology state survives readiness condition removal — the core fix enabling recovery after transient probe blips
  • Update notifyListeners() and addListener() to send combined state (caps != null && isJobProcessingEnabled()) ensuring all listeners — including JobSchedulerImpl — correctly stop when readiness condition is absent
  • Add 3 new unit tests covering condition toggle with topology preservation, addListener() combined-state contract, and rapid toggle stress

Context

When the readiness condition is removed (e.g. during a transient K8S probe failure), unbindJobProcessingEnabledCondition() called stopProcessing() which destroyed topologyCapabilities. When the condition returned, notifyListeners() sent active=false (because topology was null) and job processing never resumed — requiring an unrelated topology event to recover.

Additionally, notifyListeners() and addListener() only sent caps != null to listeners, ignoring the readiness condition state. This meant JobSchedulerImpl would continue scheduling jobs even when the readiness condition was absent.

Changes

  1. unbindJobProcessingEnabledCondition(): Remove stopProcessing() call — preserve topology state so recovery is immediate when the condition returns
  2. notifyListeners(): Send caps != null && isJobProcessingEnabled() instead of just caps != null — all listeners see the correct combined state
  3. addListener(): Same combined-state logic for the initial callback on subscription — ensures freshly registered listeners don't become active when the condition is absent

Test plan

  • All 112 existing unit tests pass (0 failures, 0 errors)
  • New: testConditionTogglePreservesTopology — verifies topology survives unbind/rebind and listeners get correct active state
  • New: testAddListenerSendsCombinedState — verifies addListener() sends false when topology exists but condition is absent (JobSchedulerImpl startup path)
  • New: testRapidConditionToggle — verifies topology and state consistency after rapid toggles

@joerghoh
Copy link
Copy Markdown
Contributor

@daniancu can you please check the Sonar summary at https://sonarcloud.io/summary/new_code?id=apache_sling-org-apache-sling-event&pullRequest=54 and see if we can improve the rating? Thanks!

@joerghoh
Copy link
Copy Markdown
Contributor

@daniancu is there already an issue for this at https://issues.apache.org/jira?

Comment on lines +635 to +636
final TopologyCapabilities caps = this.topologyCapabilities;
final boolean active = caps != null && isJobProcessingEnabled();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
final TopologyCapabilities caps = this.topologyCapabilities;
final boolean active = caps != null && isJobProcessingEnabled();
final boolean active = this.topologyCapabilities != null && isJobProcessingEnabled();

I would collapse these 2 statements into one.

@joerghoh
Copy link
Copy Markdown
Contributor

I agree that right now the semantics of topologyCapabilities is not really explicit, and it having a value of null has many side effects. With this PR we also should improve it's documentation and explain what it now means when it's null. Don't have a good idea yet how to phrase it.

@daniancu daniancu changed the title Fix JobManager readiness condition to preserve topology on transient probe failures SLING-13169 - Fix JobManager readiness condition to preserve topology on transient probe failures Apr 14, 2026
@daniancu daniancu force-pushed the feature/SKYOPS-137756-fix-jobmanager-readiness-condition branch 2 times, most recently from 48001b9 to 92cbf76 Compare April 14, 2026 09:24
…pology on transient probe failures

Remove stopProcessing() from unbindJobProcessingEnabledCondition() so that
topology state survives readiness condition removal. Update notifyListeners()
and addListener() to send combined state (topology AND readiness) ensuring
all listeners — including JobSchedulerImpl — correctly stop when the
readiness condition is absent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Ai-Assisted-By: claude
@daniancu daniancu force-pushed the feature/SKYOPS-137756-fix-jobmanager-readiness-condition branch from 92cbf76 to 44d02d3 Compare April 14, 2026 09:35
…pology on transient probe failures

Remove stopProcessing() from unbindJobProcessingEnabledCondition() so that
topology state survives readiness condition removal. Update notifyListeners()
and addListener() to send combined state (topology AND readiness) ensuring
all listeners — including JobSchedulerImpl — correctly stop when the
readiness condition is absent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Ai-Assisted-By: claude
@sonarqubecloud
Copy link
Copy Markdown

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants