Skip to content

Fix GPU Integ Tests and add PV, PVC Integ Tests#637

Merged
sky333999 merged 1 commit intomainfrom
sky333999/gpu-fixes
Jan 8, 2026
Merged

Fix GPU Integ Tests and add PV, PVC Integ Tests#637
sky333999 merged 1 commit intomainfrom
sky333999/gpu-fixes

Conversation

@sky333999
Copy link
Copy Markdown
Contributor

@sky333999 sky333999 commented Dec 16, 2025

Description of changes

1) Make GPU addon test consistent with GPU daemon test

  • The GPU addon test previously followed a diff pattern that does some setup via terraform and some setup for patching the agent image in the GH workflow. This PR standardizes the setup by making moving all the setup to terraform.

2) Use actual GPU nodes for GPU daemon test and spin up real gpu-burn

  • This allows generating all GPU metrics that rely on a real gpu pod existing (& mapped to a service). The pod is spun up via a statefulset so we can deterministically set the name to match the mock dcgm-exporter data.

3) Update PV & PVC tests

  • Add validations for remaining PV & PVC metrics
  • Create PV & PVC as part of tf setup so we get real metrics

4) Remove unnecessary UseE2EMetrics flag and clean up related code

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

https://github.com/aws/amazon-cloudwatch-agent/actions/runs/20257453246 shows passing integ tests for GPU/Daemon and GPU/Addon.

Comment thread terraform/eks/daemon/gpu/main.tf
@sky333999 sky333999 marked this pull request as ready for review December 16, 2025 22:57
@sky333999 sky333999 requested a review from a team as a code owner December 16, 2025 22:57
movence
movence previously approved these changes Dec 17, 2025
the-mann
the-mann previously approved these changes Dec 17, 2025
Copy link
Copy Markdown
Contributor

@the-mann the-mann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall really good change, thanks for getting this done. I have a couple suggestions on how we can improve the debugging experience if the test fails again


provisioner "local-exec" {
command = <<-EOT
cd ../../../..
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit nitpicky, but can we use something like

cd "$(git rev-parse --show-toplevel)"

instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking if we should avoid adding a dependency on git for this..
For our GH runners specifically, it should work - but assuming someone has this code available somehow without git, may end up causing some pain.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think that's a pretty extreme edge case, no?

Comment thread terraform/eks/daemon/gpu/main.tf
Comment thread terraform/eks/addon/gpu/main.tf
@sky333999 sky333999 dismissed stale reviews from the-mann and movence via 3a5c5b1 January 8, 2026 19:24
@sky333999 sky333999 force-pushed the sky333999/gpu-fixes branch from 4efe31c to 3a5c5b1 Compare January 8, 2026 19:24
@sky333999 sky333999 merged commit ec292b2 into main Jan 8, 2026
6 checks passed
@sky333999 sky333999 deleted the sky333999/gpu-fixes branch January 8, 2026 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants