Yousef/sync main to cohere#32
Open
yousef-cohere wants to merge 104 commits into
Open
Conversation
Enhance the test suite for the agent-protocol-forwarder component by adding new test cases across 5 test functions. The improvements include comprehensive tests for Config.Setup() method covering all command-line flag combinations, TLS configuration scenarios (enabled/disabled/skip-verify), edge cases for config file loading (empty files, null values, extra fields, permission errors), host interface configuration, default value handling, and error scenarios (missing files, invalid JSON). Signed-off-by: Anjana A R K <anjana.a.r.k1@ibm.com>
When a pod uses initdata (e.g. KBS tests with cc_kbc), both kata-agent
and confidential-data-hub start after process-user-data completes.
The startup order is:
process-user-data -> AA -> AA socket -> CDH (via CDH.path) -> CDH socket
|
-> kata-agent (direct enable)
kata-agent.path already exists to gate kata-agent on CDH socket
appearance (PathExists=/run/confidential-containers/cdh.sock). However,
kata-agent.service also had WantedBy=multi-user.target in its [Install]
section, causing systemd to activate it directly at boot without waiting
for the path unit condition to be satisfied.
Fix: remove [Install]/WantedBy=multi-user.target from kata-agent.service
so that systemd can only activate it via kata-agent.path.
Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
The preset file referenced 'attestation-protocol-forwarder.service' which does not exist. The correct service name is 'agent-protocol-forwarder.service'. This was a no-op in practice because agent-protocol-forwarder.service has WantedBy=multi-user.target in its [Install] section, so systemd enables it via the symlink in multi-user.target.wants/ regardless. The stale preset name has been corrected. Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Add agent-protocol-forwarder.path (watches /run/kata-containers/agent.sock)
and scratch-storage.path (watches /run/peerpod/scratch-space.marker) so
both services start only when their prerequisite exists, eliminating the
Restart=on-failure polling loop for APF.
Remove redundant time-based After= deps from AA, CDH, and kata-agent that
are already implied by the path chain. Keep After=process-user-data.service
on AA and CDH: process-user-data writes aa.toml before cdh.toml, so without
this guard CDH can start before cdh.toml is written and lose its KBS config.
Keep After=kata-agent.service on api-server-rest so CDH finishes plugin
init before api-server-rest connects to it.
Remove [Install]/WantedBy=multi-user.target from kata-agent.service so
systemd can only activate it via kata-agent.path. Update 30-coco.preset
and multi-user.target.wants to enable the path units instead of the
services directly.
Activation chain:
process-user-data -> aa.toml -> AA -> AA.sock -> CDH > CDH.sock
-> kata-agent -> agent.sock -> APF -> setup-nat-for-imds
-> api-server-rest (After=kata-agent)
scratch-space.marker -> scratch-storage (kata-agent After= orders it)
Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Bumps [github.com/moby/spdystream](https://github.com/moby/spdystream) from 0.5.0 to 0.5.1. - [Release notes](https://github.com/moby/spdystream/releases) - [Commits](moby/spdystream@v0.5.0...v0.5.1) --- updated-dependencies: - dependency-name: github.com/moby/spdystream dependency-version: 0.5.1 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>
…tlptracehttp Bumps [go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp](https://github.com/open-telemetry/opentelemetry-go) from 1.21.0 to 1.43.0. - [Release notes](https://github.com/open-telemetry/opentelemetry-go/releases) - [Changelog](https://github.com/open-telemetry/opentelemetry-go/blob/main/CHANGELOG.md) - [Commits](open-telemetry/opentelemetry-go@v1.21.0...v1.43.0) --- updated-dependencies: - dependency-name: go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp dependency-version: 1.43.0 dependency-type: indirect ... --- Updated modules with `go mod tidy` Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Bumps [aws-actions/configure-aws-credentials](https://github.com/aws-actions/configure-aws-credentials) from 6.0.0 to 6.1.0. - [Release notes](https://github.com/aws-actions/configure-aws-credentials/releases) - [Changelog](https://github.com/aws-actions/configure-aws-credentials/blob/main/CHANGELOG.md) - [Commits](aws-actions/configure-aws-credentials@8df5847...ec61189) --- updated-dependencies: - dependency-name: aws-actions/configure-aws-credentials dependency-version: 6.1.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 7.0.0 to 7.1.0. - [Release notes](https://github.com/docker/build-push-action/releases) - [Commits](docker/build-push-action@d08e5c3...bcafcac) --- updated-dependencies: - dependency-name: docker/build-push-action dependency-version: 7.1.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 7.0.0 to 7.0.1. - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](actions/upload-artifact@bbbca2d...043fb46) --- updated-dependencies: - dependency-name: actions/upload-artifact dependency-version: 7.0.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>
Bumps [oras-project/setup-oras](https://github.com/oras-project/setup-oras) from 1.2.4 to 2.0.0. - [Release notes](https://github.com/oras-project/setup-oras/releases) - [Commits](oras-project/setup-oras@22ce207...38de303) --- updated-dependencies: - dependency-name: oras-project/setup-oras dependency-version: 2.0.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Lukáš Doktor <ldoktor@redhat.com> * github.com:confidential-containers/cloud-api-adaptor: build(deps): bump aws-actions/configure-aws-credentials
…updates Bumps the google-cloud group with 1 update in the /src/cloud-api-adaptor directory: [cloud.google.com/go/compute](https://github.com/googleapis/google-cloud-go). Bumps the google-cloud group with 3 updates in the /src/cloud-providers directory: [cloud.google.com/go/compute](https://github.com/googleapis/google-cloud-go), [cloud.google.com/go/resourcemanager](https://github.com/googleapis/google-cloud-go) and [cloud.google.com/go/auth](https://github.com/googleapis/google-cloud-go). Updates `cloud.google.com/go/compute` from 1.58.0 to 1.59.0 - [Release notes](https://github.com/googleapis/google-cloud-go/releases) - [Changelog](https://github.com/googleapis/google-cloud-go/blob/main/CHANGES.md) - [Commits](googleapis/google-cloud-go@compute/v1.58.0...compute/v1.59.0) Updates `cloud.google.com/go/compute` from 1.58.0 to 1.59.0 - [Release notes](https://github.com/googleapis/google-cloud-go/releases) - [Changelog](https://github.com/googleapis/google-cloud-go/blob/main/CHANGES.md) - [Commits](googleapis/google-cloud-go@compute/v1.58.0...compute/v1.59.0) Updates `cloud.google.com/go/resourcemanager` from 1.11.0 to 1.12.0 - [Release notes](https://github.com/googleapis/google-cloud-go/releases) - [Changelog](https://github.com/googleapis/google-cloud-go/blob/main/documentai/CHANGES.md) - [Commits](googleapis/google-cloud-go@iap/v1.11.0...iap/v1.12.0) Updates `cloud.google.com/go/auth` from 0.18.2 to 0.20.0 - [Release notes](https://github.com/googleapis/google-cloud-go/releases) - [Changelog](https://github.com/googleapis/google-cloud-go/blob/main/CHANGES.md) - [Commits](googleapis/google-cloud-go@auth/v0.18.2...v0.20.0) --- updated-dependencies: - dependency-name: cloud.google.com/go/compute dependency-version: 1.59.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: google-cloud - dependency-name: cloud.google.com/go/compute dependency-version: 1.59.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: google-cloud - dependency-name: cloud.google.com/go/resourcemanager dependency-version: 1.12.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: google-cloud - dependency-name: cloud.google.com/go/auth dependency-version: 0.20.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: google-cloud ... Signed-off-by: dependabot[bot] <support@github.com>
The CI job for AWS has failed for a while now and we still don't know the cause. Instead of disabling it completely, let's just ignore its status because it is still worth running it (e.g. catch build/setup/infra issues). Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
iptables-wrapper-installer.sh was removed in kubernetes-sigs/iptables-wrappers#14, so call the binary directly. Assisted-by: IBM Bob Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
There is an issue in setup-go that it lacks endian awareness, so port the fix from caa_build_and_push_per_arch.yaml to the standard CAA build workflow, to enable us to run the workflow on the ppc runner, rather than needing to use emulation, which can be slow Drive-by-fix of zizmor warning Signed-off-by: stevenhorsman <steven@uk.ibm.com>
The difference between the caa_build_and_push and caa_build_and_push_per_arch is confusing. I hope to address this in this PR, but let's start by renaming for better clarity. Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Add a helper script that can create one or more multi-arch manifest images for our three supported architectures giving a registry and a list of one, or more tags. Based on the release.sh script from kata-containers Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Currently we have multiple formats of image tags for dev images: - <release>-dev-<arch> for released images - dev-<sha>-<arch> for interim published images - latest-<arch>-dev for daily e2e test images - ci-pr<pr number>-dev (no arch) for the x86 only packer PR e2e test images - ci-pr<pr number>-<arch>-dev for mkosi specific-arch PR e2e test images This shows that we have multiple different code paths, or logic being run to do the same task and we'd like to reduce duplication and increase consistency, so let's move all to the release version: <tag>-dev-<arch> Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Rather than having separate logic and builds for the multi-arch image including the confusing upload/download of a tags file to drive things, we can just swap and use the existing CAA build workflow, to build the images for each arch and the new multi-arch publish to create the multi-arch manifest. Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Nothing should be calling the `image-with-arch` make target anymore now that the process is unified, so remove it and the code that only it called to simplify and remove duplication. Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Similar to kata-containers/kata-containers@a04df4f disable the provenance and sbom for single arch images, so that we can use them in a multi-arch image later Signed-off-by: stevenhorsman <steven@uk.ibm.com>
For legacy? reasons AWS is using the non-arch specific CAA image build, but given that it's now the same as the x86 e2e image, switch to that to reduce duplication Signed-off-by: stevenhorsman <steven@uk.ibm.com>
The non-debug images have been published as debug by mistake. inputs.debug is a boolean type and it should be handled as boolean. Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
Re-usable workflows inherit the workflow name from the caller, so extend the concurrency group to make it unique to the instance. Signed-off-by: stevenhorsman <steven@uk.ibm.com>
We want to switch for the fedora-based mkosi podvm image, to the ubuntu based one for stability and GPU support, so add e2e tests, so see how it's working. Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Bump components to match the kata 3.29.0 release Assisted-by: IBM Bob Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Update to pick up the 3.29.0 release Signed-off-by: stevenhorsman <steven@uk.ibm.com>
`GetDiagnosticData` has gone into the agent protos, so we need to add it in our implementations of this too. Assisted-by: IBM Bob Signed-off-by: stevenhorsman <steven@uk.ibm.com>
- Enhanced test coverage for interceptor_test.go forwarder_test.go with comprehensive subtests - It covers mount errors, namespace handling, DNS configuration, TLS setup, daemon lifecycle, and edge cases. Signed-off-by: Anjana A R K <anjana.a.r.k1@ibm.com>
Bumps [github/codeql-action](https://github.com/github/codeql-action) from 4.35.1 to 4.35.2. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](github/codeql-action@c10b806...95e58e9) --- updated-dependencies: - dependency-name: github/codeql-action dependency-version: 4.35.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>
Split the monolithic peerpod-ctrl_image.yaml into two workflows following the same pattern as CAA: - peerpod-ctrl_build_and_push.yaml: per-arch callable workflow - peerpod-ctrl_build_and_push_all_arches.yaml: orchestrator + manifest Assisted-by: Claude Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
After the kustomize-to-helm migration, nightly e2e tests deploy peerpod-ctrl and webhook as sub-charts but use stale `latest` images from quay.io. Since peerpod-ctrl shares the cloud-providers Go module with caa, version skew can mask breaking changes. Build both images alongside caa/podvm in the nightly pipeline and wire their references through to the helm install via PEERPOD_CTRL_IMAGE and WEBHOOK_IMAGE environment variables. The subchart image override logic is centralized in Helm.ConfigureSubchartImages() to avoid duplicating code across all provider implementations. Assisted-by: Claude Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
The packer-based podvm images for azure haven't been actively maintained for years and are most likely insecure and not functional. To avoid confusion we remoe the packer build-infra from the repo. Signed-off-by: Magnus Kulke <magnuskulke@microsoft.com>
The terraform based ci-infra folder was only used by azure, it makes sense to move it to the azure subfolder for discoverability. Signed-off-by: Magnus Kulke <magnuskulke@microsoft.com>
AWSUserDataProvider currently issues a bare GET to /latest/user-data (IMDSv1). On EC2 instances configured with MetadataOptions.HttpTokens=required (IMDSv2-only), this returns 401 and peer-pod boot fails before kata-agent starts. Many enterprise AWS organizations enforce IMDSv2 via an SCP, so the bare IMDSv1 path is unusable in those environments, and AWS now defaults new EC2 launches to v2-only as well. This change adds an IMDSv2 token PUT before the user-data GET and attaches the returned session token via the X-aws-ec2-metadata-token header. If the token PUT fails for any reason (network policy blocks PUT, legacy IMDSv1-only configuration, transient error), the helper returns nil headers so the existing IMDSv1 GET path is preserved as a fallback. No existing flow regresses. Validated on an AWS organization with SCP-enforced HttpTokens=required: peer-pod boots end to end, /dev/sev-guest reachable inside the SEV-SNP guest, attestation report retrieved. Unit tests cover the success path, non-200 token response, the returned headers shape, and an end-to-end fallback where the token endpoint 401s and the user-data GET succeeds without the token header. Ref: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-options.html Signed-off-by: Abhishek Agrawal <abhishek.yours4@gmail.com>
Ensure mkosi debug builds use the active image-tree repart layout and include NVMe modules so GCP boot disks are visible during initrd root setup. Co-authored-by: Cursor <cursoragent@cursor.com>
On AWS SEV-SNP enabled EC2 instances (the launch shape used for peer-pod PodVMs when CpuOptions.AmdSevSnp=enabled is set), the mkosi-built Fedora-based PodVM image consistently fails to boot. In my environment the kernel panics during early init -- the EC2 serial console shows the vmgenid driver loading and subsequent platform / device probing wedging before systemd ever starts. Issue confidential-containers#2691 reports a related manifestation on a different setup where the guest exits with Client.InstanceInitiatedShutdown ~12 seconds into kernel init. The same image boots normally on non-SEV-SNP instances; switching only the AmdSevSnp CpuOption flips the build between booting and not booting. Adding initcall_blacklist=vmgenid_plaform_driver_init to the kernel command line skips the vmgenid platform driver registration, the early init path no longer wedges, and the PodVM boots end to end. (The 'plaform' typo is intentional -- it matches the kernel symbol name in drivers/virt/vmgenid.c, which declares 'static struct platform_driver vmgenid_plaform_driver'.) This is the same one-line workaround posted by @bpradipt in issue confidential-containers#2691 on 2025-12-23, originally bundled into the Fedora 43 upgrade in confidential-containers#2729 but dropped when that PR was superseded by the slimmer confidential-containers#2914. The fix was never re-extracted into its own PR. Per @mkulke's suggestion on confidential-containers#2729 ("a PR would be better (if the fix is urgent), since there is probably more work required for s390x"), this commit carries only the kernel-cli line in isolation. vmgenid is x86-only, so initcall_blacklist of vmgenid_plaform_driver_init is a no-op on s390x kernels and should not require any s390x-specific handling. Placed in the base mkosi.conf rather than a per-arch conf to match the location of the patch originally posted on confidential-containers#2691. Validated end to end on AWS SEV-SNP peer-pods (c6a.2xlarge in us-east-2): PodVM boots, SEV-SNP is enabled (CpuOptions.AmdSevSnp=enabled), kata-agent comes up and serves the agent endpoint over the vxlan tunnel from the worker, and an AMD-signed SEV-SNP attestation report is retrieved from inside the guest. The same kernel-cli workaround is currently being hit cross-distro -- siderolabs/talos#13118 reports the equivalent boot hang on Talos 1.12 on AWS, suggesting the underlying kernel fix (https://www.spinics.net/lists/kernel/msg5976520.html) has not yet propagated to widely-used distro kernels. Refs: - confidential-containers#2691 (boot-hang issue, closed with workaround in comment) - confidential-containers#2729 (original F43 upgrade PR that bundled this fix, closed) - confidential-containers#2914 (slim F43 bump that superseded confidential-containers#2729, did not include this) - siderolabs/talos#13118 (same kernel issue on Talos / AWS SNP) Signed-off-by: Abhishek Agrawal <abhishek.yours4@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…ohere Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # src/cloud-api-adaptor/install/charts/peerpods/values.yaml
Upstream merge brought in new transitive dependencies (otelgrpc, otelhttp, google.golang.org/api, genproto, grpc) and bumped several cloud.google.com/go modules. Tidy go.mod/go.sum to match. Co-authored-by: Cursor <cursoragent@cursor.com>
alhassankhedr-cohere
previously approved these changes
May 22, 2026
Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 59340ba. Configure here.
Co-authored-by: Cursor <cursoragent@cursor.com>
alhassankhedr-cohere
previously approved these changes
May 22, 2026
The previous ExecStartPost pipeline in process-user-data.service relied on /bin/sh + sed + printf to convert the hex digest in /run/peerpod/initdata.digest into 48 raw bytes before piping into the rtmr3 sysfs node. On Ubuntu /bin/sh is dash, whose printf does not honour \xHH escapes; combined with systemd unit-file parsing collapsing \\\\x to \\x and GNU sed dropping the backslash before the 'x' in its replacement, the bytes that actually reached the kernel were the ASCII string "xHHxHHxHH..." for the first 16 hex pairs of the digest. RTMR3 was therefore extended with garbage that bound to only ~128 bits of the digest and was sensitive to upstream dash/sed/systemd parsing changes, so SHA384(0 || digest) predictions never matched. Move the hex-to-binary conversion into a small dedicated bash helper at /usr/local/bin/extend-rtmr3-initdata, which uses bash parameter expansion and printf to write the raw 48 bytes. Verified on a live podvm peer pod that the resulting RTMR3 matches SHA384(prev || digest) bit-for-bit. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

This PR introduces BYOM (Bring Your Own Machine) e2e tests and enhances the e2e test workflows by adding support for peerpod-ctrl and webhook images.
Note
Medium Risk
Touches CI pipelines and e2e provisioning logic across AWS/libvirt/docker, plus changes metadata fetching (AWS IMDSv2) and TDX RTMR3 measurement; failures could break test coverage or boot/attestation paths but are mostly additive and guarded with fallbacks.
Overview
Adds BYOM e2e coverage by introducing a new callable
e2e_byom.yaml, wiring it intoe2e_run_all.yaml/e2e_on_pull.yaml, and extending existing AWS/libvirt/docker workflows to accept and pass throughpeerpod_ctrl_imageandwebhook_imageoverrides.Refactors
peerpod-ctrlimage publishing to build per-architecture images + a multi-arch manifest (peerpod-ctrl_build_and_push.yaml+ newpeerpod-ctrl_build_and_push_all_arches.yaml) and updates push/release pipelines to use it;podvm_mkosi_ubuntu.yamlnow also builds/outputs a BYOM e2e podvm container image.Improves runtime/provisioning behavior: AWS userdata retrieval now prefers IMDSv2 token auth with IMDSv1 fallback (new tests), AWS e2e VPC provisioning chooses subnets in AZs that support the configured
podvm_instance_typeand registers AMIs with UEFI+TPM support, libvirt adds optional vCPU pinning viaLIBVIRT_CPUSET, and PodVM systemd overrides move RTMR3 extension into a dedicatedextend-rtmr3-initdatascript.Housekeeping: removes the Azure nightly build workflow and legacy Azure image build docs, bumps CodeQL action version, updates IBM SDK deps, updates Helm chart dependency version pinning, and updates
CITATION.cffto0.20.0.Reviewed by Cursor Bugbot for commit 670cecd. Bugbot is set up for automated code reviews on this repo. Configure here.