Skip to content

Fix telegraf startup failure when no prometheus configmap is mounted#1640

Closed
suyadav1 wants to merge 11 commits intoci_prodfrom
suyadav/fix-telegraf-prom-defaults
Closed

Fix telegraf startup failure when no prometheus configmap is mounted#1640
suyadav1 wants to merge 11 commits intoci_prodfrom
suyadav/fix-telegraf-prom-defaults

Conversation

@suyadav1
Copy link
Copy Markdown
Contributor

@suyadav1 suyadav1 commented Apr 10, 2026

When no container-azm-ms-agentconfig configmap is present, the ruby parser (tomlparser-prom-customconfig.rb) skipped substituting $AZMON_DS_PROM_* placeholders in telegraf.conf. With Telegraf 1.37.x this was silently ignored, but Telegraf 1.38.0+ defaults to strict env var handling and fails with 'invalid TOML syntax' on the undefined vars.

Add substituteDsDefaultsInTelegrafConf() that replaces placeholders with default values (empty arrays, 1m interval) directly in telegraf.conf. Called from all fallback paths: no configmap, nil configmap settings, typecheck failure, and parse exceptions.

Tested multiple Telegraf scenarios by updating the configmap, all are working as expected and inline with the current prod image.

This PR contains a fix for intermittent zlib gem install issue for arm64.

When no container-azm-ms-agentconfig configmap is present, the ruby
parser (tomlparser-prom-customconfig.rb) skipped substituting
$AZMON_DS_PROM_* placeholders in telegraf.conf. With Telegraf 1.37.x
this was silently ignored, but Telegraf 1.38.0+ defaults to strict env
var handling and fails with 'invalid TOML syntax' on the undefined vars.

Add substituteDsDefaultsInTelegrafConf() that replaces placeholders with
default values (empty arrays, 1m interval) directly in telegraf.conf.
Called from all fallback paths: no configmap, nil configmap settings,
typecheck failure, and parse exceptions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@suyadav1 suyadav1 requested a review from a team as a code owner April 10, 2026 21:28
suyadav1 and others added 8 commits April 13, 2026 18:37
When no OSM configmap is mounted (the common case), the
$AZMON_TELEGRAF_OSM_PROM_PLUGINS placeholder was left as a raw string
in the telegraf sidecar and replicaset config files. Telegraf 1.38.0+
uses strict env var handling by default and fails to parse config files
containing unresolved $ placeholders.

This is the same class of bug fixed in PR #1640 for $AZMON_DS_PROM_*
placeholders in the daemonset telegraf config.

Add substituteOsmDefaultsInTelegrafConf() to replace the OSM placeholder
with an empty string on all fallback paths:
- When AZMON_OSM_CFG_SCHEMA_VERSION is not set (no configmap mounted)
- When parseConfigMap returns nil (configmap exists but parse failed)
- When schema version is unsupported

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comprehensive skill covering 8 test phases across all container types
(daemonset, replicaset, sidecar, windows):

- Pre-flight: version, pod health, process checks
- Config parsing: placeholder substitution, TOML test mode, error logs
- Default features: kubelet, disk, diskio, net metrics, loopback filter
- Custom features: pod scraping, scrape scope, namespace filtering,
  field filtering, TLS, label/field selectors, custom URLs
- Integrations: OSM, NPM, subnet IP, process metrics (App Insights)
- Data flow: socket_writer ports (25226/25228/25229), fluent-bit/fluentd
- Deprecation tracking: fieldpass->fieldinclude, strict env handling
- Windows: separate version, response_timeout, reference app

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…map scenarios

Key improvements:
- Compare test image data against production baseline in Log Analytics
  instead of checking internal data flow directly
- Only investigate telegraf internals (processes, configs, ports) when
  data comparison shows a mismatch
- Add 7 configmap scenarios testing different telegraf features:
  1. Default (no custom prom)
  2. Pod-level prometheus scraping (monitor_kubernetes_pods)
  3. Namespace-scoped scraping (monitor_kubernetes_pods_namespaces)
  4. Custom URLs (daemonset node-level)
  5. Field filtering (fieldpass/fielddrop)
  6. Label and field selectors
  7. Process metrics (procstat -> application_insights)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… failures

gem install fluentd intermittently fails during Docker builds due to
network timeouts or gem server issues. Since setup.sh has no 'set -e',
the failure is silent and the build continues without /usr/bin/fluentd.
This causes the Dockerfile COPY to fail later with:
  /usr/bin/fluentd: not found

Add retry (3 attempts with 10s delay) for fluentd, gyoku/iso8601/bigdecimal,
and tomlrb gem installs. Also add a hard exit if fluentd is not available
after retries to fail fast instead of at the Dockerfile COPY stage.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… build

The zlib gem (3.2.3, installed to fix CVE-2026-27820) requires zlib-devel
for its native extension build. On amd64, zlib-devel is installed as part
of the ruby-build dependencies. On arm64, ruby is installed from mariner
packages and zlib-devel was missing, causing:
  ERROR: Failed to build gem native extension.
  checking for deflateReset(NULL) in -lz... no

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… configs

The substituteDsDefaultsInTelegrafConf function only handled the daemonset
telegraf.conf, leaving raw $AZMON_TELEGRAF_CUSTOM_PROM_* placeholders in
telegraf-prom-side-car.conf and telegraf-rs.conf when no configmap is mounted.

Telegraf 1.38.0+ uses strict env var handling and fails on undefined env vars,
so these unsubstituted placeholders would cause telegraf to fail to start in
the sidecar container and (if ever enabled) in the RS container.

Extended substituteDsDefaultsInTelegrafConf to handle all three config files:
- telegraf.conf (DS ama-logs) - existing behavior
- telegraf-prom-side-car.conf (DS sidecar) - new
- telegraf-rs.conf (RS) - new

Also added substituteDsDefaultsInTelegrafConf calls to the typecheck-failed
and exception handlers for both RS and sidecar branches.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@suyadav1
Copy link
Copy Markdown
Contributor Author

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

suyadav1 and others added 2 commits April 14, 2026 22:57
…bled

When SIDECAR_SCRAPING_ENABLED=true, two issues left raw $AZMON_* placeholders
in telegraf-rs.conf:

1. tomlparser-osm-config.rb skipped setting @tgfConfigFile for RS when sidecar
   scraping was enabled, so the OSM placeholder was never substituted.
   Fix: Remove the sidecarScrapingEnabled check from the RS path — the OSM
   parser should always set up the RS config file for placeholder substitution.

2. tomlparser-prom-customconfig.rb skipped the pod-monitoring placeholder
   substitution block (MONITOR_PODS, SCRAPE_SCOPE, LABEL_SELECTOR,
   FIELD_SELECTOR, PLUGINS_WITH_NAMESPACE_FILTER) when sidecar scraping was
   enabled. This left 5 raw placeholders in telegraf-rs.conf.
   Fix: Add an else branch that substitutes these placeholders with defaults
   when sidecar scraping is enabled.

These unsubstituted placeholders cause Telegraf 1.38.0+ to fail with strict
env var handling. RS telegraf IS actively running on clusters that have
kubernetes_services or urls configured in the configmap, even when sidecar
scraping is enabled.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@suyadav1 suyadav1 closed this Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant