Skip to content

Conversation

@prashanthjos
Copy link

@prashanthjos prashanthjos commented Dec 24, 2025

Description

This PR adds observability for the websocket connection between the activator and autoscaler components. When the autoscaler is not reachable, operators currently have no easy way to identify this issue, which can lead to autoscaling failures.

Changes

New Metric

  • Name: kn.activator.autoscaler.reachable
  • Type: Int64Gauge
  • Values: 1 (reachable), 0 (not reachable)
  • Description: Whether the autoscaler is reachable from the activator

New Logging

  • ERROR level log when autoscaler is not reachable:
    • "Autoscaler is not reachable from activator. Stats were not sent." (on send failure)
    • "Autoscaler is not reachable from activator." (on connection check failure)

How It Works

The metric is recorded in two scenarios:

  1. Periodic check (every 5s):

    • Uses [conn.Status()] to check if connection is established
    • Catches: Connection never established, DNS failures, autoscaler not running at startup
  2. On stat send:

    • Detects [SendRaw()] failures during actual stat message transmission
    • Catches: Network timeouts, connection drops, autoscaler becoming unreachable

Testing

  • Unit tests pass (go test ./pkg/activator/... -v)
  • New test TestAutoscalerConnectionStatusMonitor added

@knative-prow
Copy link

knative-prow bot commented Dec 24, 2025

Welcome @prashanthjos! It looks like this is your first PR to knative/serving 🎉

@knative-prow knative-prow bot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 24, 2025
@knative-prow
Copy link

knative-prow bot commented Dec 24, 2025

Hi @prashanthjos. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@knative-prow
Copy link

knative-prow bot commented Dec 24, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: prashanthjos
Once this PR has been reviewed and has the lgtm label, please assign dsimansk for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot requested review from dsimansk and skonto December 24, 2025 02:33
This change adds observability for the websocket connection between the
activator and autoscaler components:

- Add `activator_autoscaler_reachable` gauge metric (1=reachable, 0=not reachable)
- Log ERROR when autoscaler is not reachable during stat sending
- Add periodic connection status monitor (every 5s) to detect connection
  establishment failures
- Add unit tests for the new AutoscalerConnectionStatusMonitor function

The metric is recorded in two scenarios:
1. When SendRaw fails/succeeds during stat message sending
2. When the periodic status check detects connection not established

This helps operators identify connectivity issues between activator and
autoscaler that could impact autoscaling decisions.
@thiagomedina
Copy link
Member

/ok-to-test

@knative-prow knative-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 24, 2025
@codecov
Copy link

codecov bot commented Dec 24, 2025

Codecov Report

❌ Patch coverage is 83.33333% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.08%. Comparing base (a8803aa) to head (1e54a49).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/activator/stat_reporter.go 80.00% 4 Missing and 1 partial ⚠️
pkg/activator/metrics.go 88.23% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16318      +/-   ##
==========================================
- Coverage   80.09%   80.08%   -0.02%     
==========================================
  Files         215      216       +1     
  Lines       13391    13429      +38     
==========================================
+ Hits        10725    10754      +29     
- Misses       2304     2312       +8     
- Partials      362      363       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@prashanthjos
Copy link
Author

/retest

@prashanthjos
Copy link
Author

Related docs PR:

knative/docs#6548

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants