Skip to content

[3.5] server: fix isLearner metric not set after snapshot#21487

Open
mango766 wants to merge 1 commit intoetcd-io:release-3.5from
mango766:fix/learner-metric-after-snapshot
Open

[3.5] server: fix isLearner metric not set after snapshot#21487
mango766 wants to merge 1 commit intoetcd-io:release-3.5from
mango766:fix/learner-metric-after-snapshot

Conversation

@mango766
Copy link
Copy Markdown

Summary

Backport of the fix for #17175 to release-3.5. The original fix (#17176) was merged to main but the previous backport attempt (#17266) was closed without merging.

After a learner node receives a snapshot (or restarts after one), the etcd_server_is_learner Prometheus metric stays at 0 instead of being set to 1. This happens because the conf change entries that originally set the metric have been compacted into the snapshot, so applyConfChange is never called for them.

Changes

  • Set the isLearner metric right after Recover() in NewServer() (server startup/restart path)
  • Set the isLearner metric right after Recover() in applySnapshot() (snapshot receive path)

This is a simpler approach than the main fix — instead of moving the metric into the membership package, we just update it at the two call sites in server.go. Both of the reporter's reproduction cases (learner joining after snapshot, learner restarting after snapshot) are covered.

Fixes #17175

After a learner receives a snapshot or restarts from a snapshot, the
etcd_server_is_learner Prometheus metric stays at 0 because the conf
change log entries that originally set it have been compacted. The
metric was only updated in applyConfChange, which is never called for
entries already folded into a snapshot.

Fix this by setting the isLearner metric right after Recover() in both
the server startup path (NewServer) and the snapshot apply path
(applySnapshot). This ensures the metric reflects the member's actual
learner status regardless of how the cluster state was loaded.

Fixes etcd-io#17175

Signed-off-by: mango766 <mango766@users.noreply.github.com>

Co-Authored-By: Claude (claude-opus-4-6) <noreply@anthropic.com>
Signed-off-by: easonysliu <easonysliu@tencent.com>
@k8s-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mango766
Once this PR has been reviewed and has the lgtm label, please assign wenjiaswe for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Copy Markdown

Hi @mango766. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

2 participants