Skip to content

feat(connectivity_check): add automated monitoring with direct Datadog integration#79

Merged
sai-praveen-os merged 39 commits intomainfrom
sai/PLATEXP-11106
Feb 12, 2026
Merged

feat(connectivity_check): add automated monitoring with direct Datadog integration#79
sai-praveen-os merged 39 commits intomainfrom
sai/PLATEXP-11106

Conversation

@sai-praveen-os
Copy link
Contributor

@sai-praveen-os sai-praveen-os commented Jan 19, 2026

JIRA: PLATEXP-11106

Add Direct Datadog Monitoring to connectivity_check Module

Problem

The connectivity_check module currently supports on-demand connectivity testing via script invocation. We need automated scheduled monitoring with near real-time metrics for proactive alerting. CloudWatch-based approaches have 15-20 minute delays, which is too slow for critical services.

Solution

Extend the module with optional monitoring capabilities using direct Datadog integration via @racker/janus-core. This provides metrics in seconds rather than minutes, enabling fast alerting on connectivity failures.

Changes

lambda/index.ts:

  • Added direct Datadog integration using @racker/janus-core stats library
  • Publishes metrics with connectivity. prefix
  • Metrics: status (gauge), latency (timing/distribution), success.count (counter), error.count (counter)
  • Comprehensive tagging: env, endpoint, host, protocol, critical
  • Additional custom tags via METRIC_TAGS environment variable
  • Metrics published via HTTP directly to Datadog (no CloudWatch intermediary)

lambda/package.json:

  • Added @racker/janus-core dependency for Datadog integration and logging

lambda/tsconfig.json:

  • TypeScript configuration for compilation to CommonJS

scripts/build-lambda.sh:

  • Automated build script for creating lambda.zip
  • Handles npm authentication via GitHub PAT
  • Compiles TypeScript to JavaScript
  • Packages dependencies

.github/workflows/build-lambda.yml:

  • GitHub Actions workflow to build lambda.zip on commits
  • Authenticates to GitHub Packages for private @racker/* dependencies
  • Commits built lambda.zip to repository for Terraform consumption

main.tf:

  • Uses pre-built lambda.zip (no build during Terraform apply)
  • Configurable timeout (default: 60s for multiple endpoint checks)
  • OpenTelemetry layer for observability
  • Environment variables: DATADOG_API_KEY, ENVIRONMENT, METRIC_TAGS

variables.tf:

  • Added monitoring configuration: enable_monitoring, monitoring_schedule, monitoring_targets
  • Added Datadog configuration: datadog_api_key, environment, metric_tags
  • Configurable Lambda timeout and memory

monitoring.tf:

  • EventBridge rule for scheduled Lambda invocation (default: 1 minute)
  • Lambda permissions for EventBridge invocation
  • No CloudWatch alarms (alerting handled in Datadog)

Usage Examples

Basic Monitoring Setup

module "connectivity_monitor" {
  source = "git@github.com:RSS-Engineering/terraform//modules/connectivity_check?ref=<commit>"
  
  function_name      = "connectivity-check-primary-dev"
  subnet_ids         = ["subnet-xxx"]
  security_group_ids = ["sg-xxx"]
  
  enable_monitoring   = true
  monitoring_schedule = "rate(1 minute)"
  
  monitoring_targets = [
    {
      host     = "identity.api.rackspacecloud.com"
      port     = 443
      protocol = "https"
      path     = "/v2.0"
      critical = true
    },
    {
      host     = "api.example.com"
      port     = 443
      protocol = "https"
      critical = false
    }
  ]
  
  datadog_api_key = var.datadog_api_key
  environment     = "dev"
  metric_tags     = "service:janus,team:platform"
}

Monitoring with Custom Timeout

module "connectivity_monitor" {
  source = "git@github.com:RSS-Engineering/terraform//modules/connectivity_check?ref=<commit>"
  
  # ... other config ...
  
  timeout = 120  # For checking many endpoints
}

Metrics Published to Datadog

All metrics use the connectivity. prefix:

Metric Type Description Tags
connectivity.endpoint.status gauge 1 = up, 0 = down env, endpoint, host, protocol, critical, [custom tags]
connectivity.endpoint.latency timing (distribution) Response time in milliseconds (creates .avg, .min, .max, .95percentile, etc.) env, endpoint, host, protocol, critical, [custom tags]
connectivity.endpoint.success.count counter Successful checks env, endpoint, host, protocol, critical, [custom tags]
connectivity.endpoint.error.count counter Failed checks env, endpoint, host, protocol, critical, [custom tags]

Note: The latency metric uses stats.timing() which creates a distribution metric with automatic aggregations (.avg, .min, .max, .median, .95percentile, etc.). Custom tags from metric_tags variable are appended to all metrics.

Key Features

  • Near real-time metrics: Published to Datadog in seconds (vs 15-20 minutes with CloudWatch)
  • Direct integration: No CloudWatch intermediary, reducing complexity and cost
  • Flexible scheduling: Configurable check frequency (default: 1 minute)
  • Comprehensive tagging: Easy filtering and aggregation in Datadog
  • Custom metric tags: Add service, team, or other tags via metric_tags variable
  • Critical flag: Enables different alert severities in Datadog monitors
  • Pre-built Lambda: GitHub Actions builds lambda.zip, avoiding npm auth issues during Terraform apply
  • Backward compatible: All monitoring features are optional

Implementation Notes

Lambda Module System: The Lambda uses CommonJS (require()) instead of ES modules due to @racker/janus-core@12.4.0 being CommonJS-only with no ES module support. ES modules were attempted but failed at runtime with directory import errors. This is an internal implementation detail that doesn't affect module consumers.

Latency Metrics: The connectivity.endpoint.latency metric uses stats.timing() which creates a Datadog distribution metric. This automatically generates aggregated metrics like .avg, .min, .max, .median, .95percentile, etc. Use connectivity.endpoint.latency.avg in Datadog queries.

Build Process

  1. Developer commits changes to lambda/ directory
  2. GitHub Actions workflow triggers on push
  3. Workflow authenticates to GitHub Packages using JANUS_GITHUB_PAT
  4. TypeScript compiled to JavaScript
  5. Dependencies installed and packaged
  6. lambda.zip committed back to repository
  7. Terraform uses pre-built lambda.zip (no build during apply)

Testing

Tested in janus-infrastructure dev environment:

  • Lambda executes successfully every minute
  • Metrics published to Datadog within seconds
  • All endpoints checked correctly (TCP and HTTPS)
  • Tags applied correctly for filtering
  • Dashboard created successfully

Breaking Changes

None - all monitoring features are opt-in via enable_monitoring flag.

…trics and alarms

- Add monitoring.tf with EventBridge scheduling and CloudWatch alarms
- Extend Lambda handler to publish EndpointConnectivity and EndpointLatency metrics
- Add @aws-sdk/client-cloudwatch dependency to package.json
- Add monitoring configuration variables (enable_monitoring, monitoring_schedule, monitoring_targets, alarm_sns_topic_arns)
- Create per-critical-endpoint alarms, aggregate alarm, and Lambda error alarm
- All monitoring features are optional and backward compatible

Task: Enable proactive monitoring for Janus external dependencies after identity service outage incident
sai-praveen-os and others added 14 commits January 22, 2026 15:11
- Include critical field directly in testTcp() and testHttp() results
- Improve alarm granularity: 1-min schedule, 3 evaluation periods (3-min alert vs 10-min)
- Hard-code CloudWatch namespace to 'connectivity' for Datadog consistency
- Remove duplicate variable declarations from monitoring.tf (42 lines)
- Remove unused failOnConnectivityLoss logic
- Make CloudWatch alarms optional when alarm_sns_topic_arns is empty
- Support Datadog-only monitoring (metrics without CloudWatch alarms)

Changes reduce code by 56 lines while improving responsiveness and flexibility.

JIRA: PLATEXP-11106
…g integration

Replace CloudWatch custom metrics with direct Datadog integration using
janus-core library and OpenTelemetry. This addresses performance and cost
concerns raised in PR review.

Changes:
- Lambda handler now uses @janus.team/janus-core for metrics/logging
- Metrics reach Datadog in seconds (not 15-20 minutes)
- Cost reduced from $0.30/metric (CloudWatch) to $0.001/metric (Datadog)
- Added OpenTelemetry layer and required environment variables
- Removed all CloudWatch alarm resources (117 lines)
- Simplified monitoring.tf to only handle EventBridge scheduling
- Changed package.json to commonjs (required for janus-core)
- Fixed npm_requirements flag to true (now has dependencies)

Metrics published:
- connectivity.endpoint.status (gauge: 1=success, 0=failure)
- connectivity.endpoint.latency (gauge: milliseconds)
- connectivity.endpoint.success.count (counter)
- connectivity.endpoint.error.count (counter)

PLATEXP-11106
…da_function_tags

The terraform-aws-modules/lambda module does not expose lambda_function_tags
as an output. Use inline tags matching the Lambda function tags instead.

PLATEXP-11106
Switch to pre-built Lambda package to avoid npm authentication issues during
terraform apply. This allows the module to work across all consuming repos
without requiring npm credentials.

Changes:
- Renamed handler.ts to index.ts (standard convention)
- Updated main.tf to use pre-built lambda.zip instead of building from source
- Created build-lambda.sh script to build Lambda package with dependencies
- Added GitHub Actions workflow to auto-rebuild lambda.zip on code changes
- Updated .gitignore to allow committing lambda.zip
- Added README with usage and build instructions

The lambda.zip will be built by GitHub Actions when Lambda code changes,
bundling @janus.team/janus-core and other dependencies.

PLATEXP-11106
Change NPM_TOKEN to JANUS_NPM_TOKEN to match the secret name configured
in the repository.

PLATEXP-11106
…s.team scope

PLATEXP-11106

- Configure setup-node action to use GitHub Packages registry
- Add .npmrc to configure @janus.team scope registry
- Use NODE_AUTH_TOKEN environment variable for authentication
- This fixes the 404 error when installing @janus.team/janus-core
PLATEXP-11106

- Changed package from @janus.team/janus-core to @racker/janus-core
- Updated GitHub Actions workflow to use @racker scope
- Updated .npmrc to configure @racker registry
- Updated import statements in Lambda handler
- Moved NODE_AUTH_TOKEN to Build Lambda package step
…uild

- Add TypeScript compilation step using npx tsc
- Copy tsconfig.json and .npmrc to build directory
- Install all dependencies (including devDependencies) for compilation
- Remove TypeScript source files after compilation
- Prune dev dependencies before packaging
- Package index.js instead of index.ts in lambda.zip

This fixes the Lambda runtime error 'Cannot find module index' by ensuring
the Lambda package contains compiled JavaScript instead of TypeScript source.

Ref: PLATEXP-11106
- Update .npmrc to use JANUS_GITHUB_PAT environment variable
- Update GitHub Actions workflow to set JANUS_GITHUB_PAT
- Ensures consistent token naming across the build process

Ref: PLATEXP-11106
Increase timeout from 10s to 60s to allow sufficient time for:
- Testing 6 endpoints with 5-second timeouts each
- DNS resolution and network latency
- Datadog metric publishing

This ensures connectivity checks complete successfully and metrics
reach Datadog for near real-time alerting on critical endpoints.

Ref: PLATEXP-11106
@sai-praveen-os sai-praveen-os changed the title feat(connectivity_check): add automated monitoring with CloudWatch metrics and alarms feat(connectivity_check): add automated monitoring with direct Datadog integration Feb 3, 2026
sai-praveen-os and others added 2 commits February 4, 2026 16:32
…execution

PLATEXP-11106

Changes:
- Modified handler to use Promise.all() for parallel execution
- All 6 endpoints now checked simultaneously instead of sequentially
- Reduces total execution time from ~30s to ~5s
- Eliminates timeout cascade where slow endpoints affect subsequent checks
- Improves reliability by preventing sequential timeout failures

Benefits:
- Faster detection of connectivity issues
- More accurate results (no timeout cascades)
- Lower Lambda execution costs
- Better user experience in dashboard
sai-praveen-os and others added 6 commits February 6, 2026 14:34
Switch lambda package from CommonJS to ES modules to be consistent with
scripts directory and avoid mixing module systems. Updated package.json
type field and tsconfig.json module target to ES2022.

Addresses PR #79 review comment from smayberry about avoiding mixed
import/require statements.

Task: PLATEXP-11106
- Use stats.timing() instead of stats.gauge() for latency metrics
- Add metric_tags variable for flexible service tagging
- Replace JANUS-specific variables with generic ENVIRONMENT and METRIC_TAGS
- Update EventBridge rule to use state parameter instead of conditional creation
- Rename janus_environment variable to environment for broader reusability

Addresses all review comments from smayberry on Feb 4, 2026.

Task: PLATEXP-11106
PLATEXP-11106

The Lambda was failing with 'Cannot use import statement outside a module'
because package.json with type: module was not included in the zip file.
Updated build script to include package.json alongside index.js.
ES modules don't support directory imports. Must explicitly specify
index.js when importing from @racker/janus-core subdirectories.

Fixes: Directory import '/var/task/node_modules/@racker/janus-core/lib/stats'
is not supported resolving ES modules

JIRA: PLATEXP-11106
ES modules don't work well with @racker/janus-core subdirectory imports.
Reverting to CommonJS which works reliably with the package.

JIRA: PLATEXP-11106
The OpenTelemetry layer was causing Lambda timeouts when combined with
janus-core stats library. Both were trying to send telemetry data,
causing conflicts and blocking stats.close().

Since janus-core stats already sends metrics directly to Datadog via
HTTP, the OTEL layer is unnecessary for this connectivity check Lambda.

Removed:
- OpenTelemetry Lambda layer
- AWS_LAMBDA_EXEC_WRAPPER environment variable

This matches the approach used in other simple monitoring Lambdas and
resolves the intermittent timeout issues.

Task: PLATEXP-11106
sai-praveen-os and others added 3 commits February 12, 2026 19:12
…nd defaults

PLATEXP-11106

- Fixed variable name from janus_environment to environment
- Added metric_tags variable to inputs table
- Updated timeout default from 10 to 60 seconds
- Added metric_tags parameter to usage example
@sai-praveen-os sai-praveen-os merged commit ea30478 into main Feb 12, 2026
@sai-praveen-os sai-praveen-os deleted the sai/PLATEXP-11106 branch February 12, 2026 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments