feat(connectivity_check): add automated monitoring with direct Datadog integration#79
Merged
sai-praveen-os merged 39 commits intomainfrom Feb 12, 2026
Merged
feat(connectivity_check): add automated monitoring with direct Datadog integration#79sai-praveen-os merged 39 commits intomainfrom
sai-praveen-os merged 39 commits intomainfrom
Conversation
…trics and alarms - Add monitoring.tf with EventBridge scheduling and CloudWatch alarms - Extend Lambda handler to publish EndpointConnectivity and EndpointLatency metrics - Add @aws-sdk/client-cloudwatch dependency to package.json - Add monitoring configuration variables (enable_monitoring, monitoring_schedule, monitoring_targets, alarm_sns_topic_arns) - Create per-critical-endpoint alarms, aggregate alarm, and Lambda error alarm - All monitoring features are optional and backward compatible Task: Enable proactive monitoring for Janus external dependencies after identity service outage incident
smayberry
reviewed
Jan 20, 2026
- Include critical field directly in testTcp() and testHttp() results - Improve alarm granularity: 1-min schedule, 3 evaluation periods (3-min alert vs 10-min) - Hard-code CloudWatch namespace to 'connectivity' for Datadog consistency - Remove duplicate variable declarations from monitoring.tf (42 lines) - Remove unused failOnConnectivityLoss logic - Make CloudWatch alarms optional when alarm_sns_topic_arns is empty - Support Datadog-only monitoring (metrics without CloudWatch alarms) Changes reduce code by 56 lines while improving responsiveness and flexibility. JIRA: PLATEXP-11106
…g integration Replace CloudWatch custom metrics with direct Datadog integration using janus-core library and OpenTelemetry. This addresses performance and cost concerns raised in PR review. Changes: - Lambda handler now uses @janus.team/janus-core for metrics/logging - Metrics reach Datadog in seconds (not 15-20 minutes) - Cost reduced from $0.30/metric (CloudWatch) to $0.001/metric (Datadog) - Added OpenTelemetry layer and required environment variables - Removed all CloudWatch alarm resources (117 lines) - Simplified monitoring.tf to only handle EventBridge scheduling - Changed package.json to commonjs (required for janus-core) - Fixed npm_requirements flag to true (now has dependencies) Metrics published: - connectivity.endpoint.status (gauge: 1=success, 0=failure) - connectivity.endpoint.latency (gauge: milliseconds) - connectivity.endpoint.success.count (counter) - connectivity.endpoint.error.count (counter) PLATEXP-11106
…da_function_tags The terraform-aws-modules/lambda module does not expose lambda_function_tags as an output. Use inline tags matching the Lambda function tags instead. PLATEXP-11106
Switch to pre-built Lambda package to avoid npm authentication issues during terraform apply. This allows the module to work across all consuming repos without requiring npm credentials. Changes: - Renamed handler.ts to index.ts (standard convention) - Updated main.tf to use pre-built lambda.zip instead of building from source - Created build-lambda.sh script to build Lambda package with dependencies - Added GitHub Actions workflow to auto-rebuild lambda.zip on code changes - Updated .gitignore to allow committing lambda.zip - Added README with usage and build instructions The lambda.zip will be built by GitHub Actions when Lambda code changes, bundling @janus.team/janus-core and other dependencies. PLATEXP-11106
Change NPM_TOKEN to JANUS_NPM_TOKEN to match the secret name configured in the repository. PLATEXP-11106
…s.team scope PLATEXP-11106 - Configure setup-node action to use GitHub Packages registry - Add .npmrc to configure @janus.team scope registry - Use NODE_AUTH_TOKEN environment variable for authentication - This fixes the 404 error when installing @janus.team/janus-core
…uild - Add TypeScript compilation step using npx tsc - Copy tsconfig.json and .npmrc to build directory - Install all dependencies (including devDependencies) for compilation - Remove TypeScript source files after compilation - Prune dev dependencies before packaging - Package index.js instead of index.ts in lambda.zip This fixes the Lambda runtime error 'Cannot find module index' by ensuring the Lambda package contains compiled JavaScript instead of TypeScript source. Ref: PLATEXP-11106
…cation Ref: PLATEXP-11106
- Update .npmrc to use JANUS_GITHUB_PAT environment variable - Update GitHub Actions workflow to set JANUS_GITHUB_PAT - Ensures consistent token naming across the build process Ref: PLATEXP-11106
Increase timeout from 10s to 60s to allow sufficient time for: - Testing 6 endpoints with 5-second timeouts each - DNS resolution and network latency - Datadog metric publishing This ensures connectivity checks complete successfully and metrics reach Datadog for near real-time alerting on critical endpoints. Ref: PLATEXP-11106
…execution PLATEXP-11106 Changes: - Modified handler to use Promise.all() for parallel execution - All 6 endpoints now checked simultaneously instead of sequentially - Reduces total execution time from ~30s to ~5s - Eliminates timeout cascade where slow endpoints affect subsequent checks - Improves reliability by preventing sequential timeout failures Benefits: - Faster detection of connectivity issues - More accurate results (no timeout cascades) - Lower Lambda execution costs - Better user experience in dashboard
smayberry
reviewed
Feb 4, 2026
Switch lambda package from CommonJS to ES modules to be consistent with scripts directory and avoid mixing module systems. Updated package.json type field and tsconfig.json module target to ES2022. Addresses PR #79 review comment from smayberry about avoiding mixed import/require statements. Task: PLATEXP-11106
- Use stats.timing() instead of stats.gauge() for latency metrics - Add metric_tags variable for flexible service tagging - Replace JANUS-specific variables with generic ENVIRONMENT and METRIC_TAGS - Update EventBridge rule to use state parameter instead of conditional creation - Rename janus_environment variable to environment for broader reusability Addresses all review comments from smayberry on Feb 4, 2026. Task: PLATEXP-11106
PLATEXP-11106 The Lambda was failing with 'Cannot use import statement outside a module' because package.json with type: module was not included in the zip file. Updated build script to include package.json alongside index.js.
a05888e to
fdef05d
Compare
ES modules don't support directory imports. Must explicitly specify index.js when importing from @racker/janus-core subdirectories. Fixes: Directory import '/var/task/node_modules/@racker/janus-core/lib/stats' is not supported resolving ES modules JIRA: PLATEXP-11106
fdef05d to
2ee2282
Compare
ES modules don't work well with @racker/janus-core subdirectory imports. Reverting to CommonJS which works reliably with the package. JIRA: PLATEXP-11106
8c6baa3 to
8d429d3
Compare
sai-praveen-os
commented
Feb 9, 2026
…TypeScript compatibility
…S module compatibility
…ith nodenext resolution
41476bf to
6583f2f
Compare
The OpenTelemetry layer was causing Lambda timeouts when combined with janus-core stats library. Both were trying to send telemetry data, causing conflicts and blocking stats.close(). Since janus-core stats already sends metrics directly to Datadog via HTTP, the OTEL layer is unnecessary for this connectivity check Lambda. Removed: - OpenTelemetry Lambda layer - AWS_LAMBDA_EXEC_WRAPPER environment variable This matches the approach used in other simple monitoring Lambdas and resolves the intermittent timeout issues. Task: PLATEXP-11106
sai-praveen-os
commented
Feb 12, 2026
…nd defaults PLATEXP-11106 - Fixed variable name from janus_environment to environment - Added metric_tags variable to inputs table - Updated timeout default from 10 to 60 seconds - Added metric_tags parameter to usage example
smayberry
approved these changes
Feb 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
JIRA: PLATEXP-11106
Add Direct Datadog Monitoring to connectivity_check Module
Problem
The
connectivity_checkmodule currently supports on-demand connectivity testing via script invocation. We need automated scheduled monitoring with near real-time metrics for proactive alerting. CloudWatch-based approaches have 15-20 minute delays, which is too slow for critical services.Solution
Extend the module with optional monitoring capabilities using direct Datadog integration via
@racker/janus-core. This provides metrics in seconds rather than minutes, enabling fast alerting on connectivity failures.Changes
lambda/index.ts:@racker/janus-corestats libraryconnectivity.prefixstatus(gauge),latency(timing/distribution),success.count(counter),error.count(counter)env,endpoint,host,protocol,criticalMETRIC_TAGSenvironment variablelambda/package.json:@racker/janus-coredependency for Datadog integration and logginglambda/tsconfig.json:scripts/build-lambda.sh:.github/workflows/build-lambda.yml:@racker/*dependenciesmain.tf:DATADOG_API_KEY,ENVIRONMENT,METRIC_TAGSvariables.tf:enable_monitoring,monitoring_schedule,monitoring_targetsdatadog_api_key,environment,metric_tagsmonitoring.tf:Usage Examples
Basic Monitoring Setup
Monitoring with Custom Timeout
Metrics Published to Datadog
All metrics use the
connectivity.prefix:connectivity.endpoint.statusconnectivity.endpoint.latencyconnectivity.endpoint.success.countconnectivity.endpoint.error.countNote: The
latencymetric usesstats.timing()which creates a distribution metric with automatic aggregations (.avg, .min, .max, .median, .95percentile, etc.). Custom tags frommetric_tagsvariable are appended to all metrics.Key Features
metric_tagsvariableImplementation Notes
Lambda Module System: The Lambda uses CommonJS (
require()) instead of ES modules due to@racker/janus-core@12.4.0being CommonJS-only with no ES module support. ES modules were attempted but failed at runtime with directory import errors. This is an internal implementation detail that doesn't affect module consumers.Latency Metrics: The
connectivity.endpoint.latencymetric usesstats.timing()which creates a Datadog distribution metric. This automatically generates aggregated metrics like.avg,.min,.max,.median,.95percentile, etc. Useconnectivity.endpoint.latency.avgin Datadog queries.Build Process
lambda/directoryJANUS_GITHUB_PATlambda.zipcommitted back to repositorylambda.zip(no build during apply)Testing
Tested in janus-infrastructure dev environment:
Breaking Changes
None - all monitoring features are opt-in via
enable_monitoringflag.