Fix telegraf startup failure when no prometheus configmap is mounted#1640
Closed
Fix telegraf startup failure when no prometheus configmap is mounted#1640
Conversation
When no container-azm-ms-agentconfig configmap is present, the ruby parser (tomlparser-prom-customconfig.rb) skipped substituting $AZMON_DS_PROM_* placeholders in telegraf.conf. With Telegraf 1.37.x this was silently ignored, but Telegraf 1.38.0+ defaults to strict env var handling and fails with 'invalid TOML syntax' on the undefined vars. Add substituteDsDefaultsInTelegrafConf() that replaces placeholders with default values (empty arrays, 1m interval) directly in telegraf.conf. Called from all fallback paths: no configmap, nil configmap settings, typecheck failure, and parse exceptions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When no OSM configmap is mounted (the common case), the $AZMON_TELEGRAF_OSM_PROM_PLUGINS placeholder was left as a raw string in the telegraf sidecar and replicaset config files. Telegraf 1.38.0+ uses strict env var handling by default and fails to parse config files containing unresolved $ placeholders. This is the same class of bug fixed in PR #1640 for $AZMON_DS_PROM_* placeholders in the daemonset telegraf config. Add substituteOsmDefaultsInTelegrafConf() to replace the OSM placeholder with an empty string on all fallback paths: - When AZMON_OSM_CFG_SCHEMA_VERSION is not set (no configmap mounted) - When parseConfigMap returns nil (configmap exists but parse failed) - When schema version is unsupported Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comprehensive skill covering 8 test phases across all container types (daemonset, replicaset, sidecar, windows): - Pre-flight: version, pod health, process checks - Config parsing: placeholder substitution, TOML test mode, error logs - Default features: kubelet, disk, diskio, net metrics, loopback filter - Custom features: pod scraping, scrape scope, namespace filtering, field filtering, TLS, label/field selectors, custom URLs - Integrations: OSM, NPM, subnet IP, process metrics (App Insights) - Data flow: socket_writer ports (25226/25228/25229), fluent-bit/fluentd - Deprecation tracking: fieldpass->fieldinclude, strict env handling - Windows: separate version, response_timeout, reference app Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…map scenarios Key improvements: - Compare test image data against production baseline in Log Analytics instead of checking internal data flow directly - Only investigate telegraf internals (processes, configs, ports) when data comparison shows a mismatch - Add 7 configmap scenarios testing different telegraf features: 1. Default (no custom prom) 2. Pod-level prometheus scraping (monitor_kubernetes_pods) 3. Namespace-scoped scraping (monitor_kubernetes_pods_namespaces) 4. Custom URLs (daemonset node-level) 5. Field filtering (fieldpass/fielddrop) 6. Label and field selectors 7. Process metrics (procstat -> application_insights) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… failures gem install fluentd intermittently fails during Docker builds due to network timeouts or gem server issues. Since setup.sh has no 'set -e', the failure is silent and the build continues without /usr/bin/fluentd. This causes the Dockerfile COPY to fail later with: /usr/bin/fluentd: not found Add retry (3 attempts with 10s delay) for fluentd, gyoku/iso8601/bigdecimal, and tomlrb gem installs. Also add a hard exit if fluentd is not available after retries to fail fast instead of at the Dockerfile COPY stage. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… build The zlib gem (3.2.3, installed to fix CVE-2026-27820) requires zlib-devel for its native extension build. On amd64, zlib-devel is installed as part of the ruby-build dependencies. On arm64, ruby is installed from mariner packages and zlib-devel was missing, causing: ERROR: Failed to build gem native extension. checking for deflateReset(NULL) in -lz... no Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… configs The substituteDsDefaultsInTelegrafConf function only handled the daemonset telegraf.conf, leaving raw $AZMON_TELEGRAF_CUSTOM_PROM_* placeholders in telegraf-prom-side-car.conf and telegraf-rs.conf when no configmap is mounted. Telegraf 1.38.0+ uses strict env var handling and fails on undefined env vars, so these unsubstituted placeholders would cause telegraf to fail to start in the sidecar container and (if ever enabled) in the RS container. Extended substituteDsDefaultsInTelegrafConf to handle all three config files: - telegraf.conf (DS ama-logs) - existing behavior - telegraf-prom-side-car.conf (DS sidecar) - new - telegraf-rs.conf (RS) - new Also added substituteDsDefaultsInTelegrafConf calls to the typecheck-failed and exception handlers for both RS and sidecar branches. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
Author
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…bled When SIDECAR_SCRAPING_ENABLED=true, two issues left raw $AZMON_* placeholders in telegraf-rs.conf: 1. tomlparser-osm-config.rb skipped setting @tgfConfigFile for RS when sidecar scraping was enabled, so the OSM placeholder was never substituted. Fix: Remove the sidecarScrapingEnabled check from the RS path — the OSM parser should always set up the RS config file for placeholder substitution. 2. tomlparser-prom-customconfig.rb skipped the pod-monitoring placeholder substitution block (MONITOR_PODS, SCRAPE_SCOPE, LABEL_SELECTOR, FIELD_SELECTOR, PLUGINS_WITH_NAMESPACE_FILTER) when sidecar scraping was enabled. This left 5 raw placeholders in telegraf-rs.conf. Fix: Add an else branch that substitutes these placeholders with defaults when sidecar scraping is enabled. These unsubstituted placeholders cause Telegraf 1.38.0+ to fail with strict env var handling. RS telegraf IS actively running on clusters that have kubernetes_services or urls configured in the configmap, even when sidecar scraping is enabled. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When no container-azm-ms-agentconfig configmap is present, the ruby parser (tomlparser-prom-customconfig.rb) skipped substituting $AZMON_DS_PROM_* placeholders in telegraf.conf. With Telegraf 1.37.x this was silently ignored, but Telegraf 1.38.0+ defaults to strict env var handling and fails with 'invalid TOML syntax' on the undefined vars.
Add substituteDsDefaultsInTelegrafConf() that replaces placeholders with default values (empty arrays, 1m interval) directly in telegraf.conf. Called from all fallback paths: no configmap, nil configmap settings, typecheck failure, and parse exceptions.
Tested multiple Telegraf scenarios by updating the configmap, all are working as expected and inline with the current prod image.
This PR contains a fix for intermittent zlib gem install issue for arm64.