Skip to content

Conversation

@evan-cz
Copy link
Contributor

@evan-cz evan-cz commented Dec 18, 2025

Istio service mesh deployments create two challenges for the CloudZero Agent: (1) strict mTLS can conflict with the webhook's self-signed TLS certificates, and (2) cross-cluster load balancing can route metrics to the wrong aggregator, corrupting cost attribution data. This change adds automatic Istio detection, runtime validation, and comprehensive documentation.

Functional Change:

Before: Istio support required manual configuration via suppressIstioAnnotations. No runtime validation existed for cross-cluster load balancing risks. Documentation was incomplete and didn't explain the underlying problems.

After: The chart auto-detects Istio via CRD presence, applies port exclusion annotations automatically, and the validator detects both sidecar and ambient modes at runtime, warning when cluster ID configuration is incorrect or missing.

Solution:

  1. Added integrations.istio.enabled (null/true/false) for auto-detect or explicit control, replacing the old suppressIstioAnnotations approach

  2. Added integrations.istio.clusterID for traffic fencing in multicluster meshes, with fallback to clusterName for automatic configuration in sidecar mode

  3. Created new Istio diagnostic provider (app/domain/diagnostic/istio/) that:

    • Detects sidecar mode by querying localhost:15000 (Envoy admin API)
    • Detects ambient mode via Downward API (ambient.istio.io/redirection annotation)
    • In sidecar mode, validates effective cluster ID (clusterID || clusterName) matches what Istio reports, ensuring the DestinationRule will work correctly
    • Requires explicit clusterID in ambient mode (cannot validate without sidecar)
  4. Added Helm helpers for Istio detection and cluster ID resolution with detailed documentation explaining the port number differences (443 vs 8443)

  5. Completely rewrote helm/docs/istio.md to explain the two problems (mTLS conflict and cross-cluster LB) with Mermaid diagrams, solutions, and configuration guide

Validation:

  • Added comprehensive unit tests for Istio diagnostic provider (istio_test.go) covering sidecar mode, ambient mode, cluster ID validation, and edge cases
  • Added Helm unittest suite (istio_integration_test.yaml) with 33 tests covering DestinationRule/VirtualService generation, port exclusion annotations, API version selection, and cluster ID configuration
  • Added schema validation tests for integrations.istio.enabled and clusterID
  • Added Istio template test to verify full manifest generation with Istio enabled
  • Manual testing pending on Istio-enabled clusters (sidecar and ambient modes)

@evan-cz evan-cz requested a review from a team as a code owner December 18, 2025 14:06
@evan-cz evan-cz force-pushed the CP-32193 branch 2 times, most recently from bf147bd to b064029 Compare December 18, 2025 14:30
Istio service mesh deployments create two challenges for the CloudZero Agent:
(1) strict mTLS can conflict with the webhook's self-signed TLS certificates,
and (2) cross-cluster load balancing can route metrics to the wrong aggregator,
corrupting cost attribution data. This change adds automatic Istio detection,
runtime validation, and comprehensive documentation.

Functional Change:

Before: Istio support required manual configuration via suppressIstioAnnotations.
No runtime validation existed for cross-cluster load balancing risks. Documentation
was incomplete and didn't explain the underlying problems.

After: The chart auto-detects Istio via CRD presence, applies port exclusion
annotations automatically, and the validator detects both sidecar and ambient
modes at runtime, warning when cluster ID configuration is incorrect or missing.

Solution:

1. Added `integrations.istio.enabled` (null/true/false) for auto-detect or explicit
   control, replacing the old `suppressIstioAnnotations` approach

2. Added `integrations.istio.clusterID` for traffic fencing in multicluster meshes,
   with fallback to `clusterName` for automatic configuration in sidecar mode

3. Created new Istio diagnostic provider (`app/domain/diagnostic/istio/`) that:
   - Detects sidecar mode by querying localhost:15000 (Envoy admin API)
   - Detects ambient mode via Downward API (ambient.istio.io/redirection annotation)
   - In sidecar mode, validates effective cluster ID (clusterID || clusterName) matches
     what Istio reports, ensuring the DestinationRule will work correctly
   - Requires explicit clusterID in ambient mode (cannot validate without sidecar)

4. Added Helm helpers for Istio detection and cluster ID resolution with detailed
   documentation explaining the port number differences (443 vs 8443)

5. Completely rewrote helm/docs/istio.md to explain the two problems (mTLS conflict
   and cross-cluster LB) with Mermaid diagrams, solutions, and configuration guide

Validation:

- Added comprehensive unit tests for Istio diagnostic provider (istio_test.go)
  covering sidecar mode, ambient mode, cluster ID validation, and edge cases
- Added Helm unittest suite (istio_integration_test.yaml) with 33 tests covering
  DestinationRule/VirtualService generation, port exclusion annotations, API
  version selection, and cluster ID configuration
- Added schema validation tests for integrations.istio.enabled and clusterID
- Added Istio template test to verify full manifest generation with Istio enabled
- Manual testing pending on Istio-enabled clusters (sidecar and ambient modes)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants