api/conversion: implement structural DGDR conversion#9262
api/conversion: implement structural DGDR conversion#9262sttts wants to merge 5 commits intoai-dynamo:mainfrom
Conversation
Signed-off-by: Dr. Stefan Schimanski <sschimanski@nvidia.com>
Signed-off-by: Dr. Stefan Schimanski <sschimanski@nvidia.com>
Signed-off-by: Dr. Stefan Schimanski <sschimanski@nvidia.com>
|
👋 Hi sttts! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
c2654ed to
8642f67
Compare
8642f67 to
e264634
Compare
WalkthroughThis PR refactors API v1alpha1↔v1beta1 conversion pipelines across DynamoComponentDeployment, DynamoGraphDeployment, and DynamoGraphDeploymentRequest by exporting previously internal converter functions, introducing typed conversion contexts to control preservation behavior, formalizing sparse annotation-based preservation/restoration logic with blob projections, and ensuring backward compatibility with legacy annotations. ChangesUnified Conversion Refactoring
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
e264634 to
912d1ef
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
deploy/operator/api/roundtrip_fuzz_test.go (1)
103-128:⚠️ Potential issue | 🟠 Major | ⚡ Quick winOnly ignore the legacy DGDR compatibility keys here.
This transformer drops
annDGDRSpecandannDGDRStatustoo, so the fuzz suite will no longer fail if a no-op round trip starts emitting structural preservation blobs. That weakens the sparse-save invariant this PR is trying to enforce.Suggested tightening
var ignoreDGDRCompatibilityAnnotations = cmpopts.AcyclicTransformer( "ignoreDGDRCompatibilityAnnotations", func(m metav1.ObjectMeta) metav1.ObjectMeta { if len(m.Annotations) == 0 { return m } annotations := make(map[string]string, len(m.Annotations)) for k, v := range m.Annotations { annotations[k] = v } - for k := range annotations { - if strings.HasPrefix(k, "nvidia.com/dgdr-") { - delete(annotations, k) - } - } + for _, k := range []string{ + "nvidia.com/dgdr-config-map-ref", + "nvidia.com/dgdr-output-pvc", + "nvidia.com/dgdr-enable-gpu-discovery", + "nvidia.com/dgdr-deployment-overrides", + "nvidia.com/dgdr-profiling-config", + "nvidia.com/dgdr-status-backend", + "nvidia.com/dgdr-profiling-results", + "nvidia.com/dgdr-deployment-status", + "nvidia.com/dgdr-profiling-job-name", + } { + delete(annotations, k) + } if len(annotations) == 0 { m.Annotations = nil } else { m.Annotations = annotations }Also applies to: 494-501
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/operator/api/roundtrip_fuzz_test.go` around lines 103 - 128, The transformer ignoreDGDRCompatibilityAnnotations is currently deleting more keys than intended (it removes annDGDRSpec and annDGDRStatus), so update the deletion logic inside the function that iterates annotations to only remove legacy keys that strictly match the DGDR compatibility prefix (e.g., strings.HasPrefix(k, "nvidia.com/dgdr-")) while explicitly preserving keys named annDGDRSpec and annDGDRStatus; adjust the conditional that calls delete(annotations, k) to skip those two identifier keys so the structural preservation blobs remain intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@deploy/operator/api/v1alpha1/dynamographdeploymentrequest_conversion.go`:
- Around line 695-727: saveDGDRHubOnlySpec currently allocates save.Workload and
save.SLA whenever src.Workload or src.SLA exist, even if their inner pointers
(Concurrency, RequestRate, E2ELatency) are all nil, causing
dgdrHubSpecSaveIsZero to misreport non-empty and write an empty hub payload;
change saveDGDRHubOnlySpec so it only creates and assigns save.Workload when at
least one of src.Workload.Concurrency or src.Workload.RequestRate is non-nil
(copy those fields as now), and only creates and assigns save.SLA when
src.SLA.E2ELatency is non-nil, leaving save.Workload/save.SLA nil otherwise;
keep other behavior (Hardware, Overrides, Features, SearchStrategy) unchanged.
---
Outside diff comments:
In `@deploy/operator/api/roundtrip_fuzz_test.go`:
- Around line 103-128: The transformer ignoreDGDRCompatibilityAnnotations is
currently deleting more keys than intended (it removes annDGDRSpec and
annDGDRStatus), so update the deletion logic inside the function that iterates
annotations to only remove legacy keys that strictly match the DGDR
compatibility prefix (e.g., strings.HasPrefix(k, "nvidia.com/dgdr-")) while
explicitly preserving keys named annDGDRSpec and annDGDRStatus; adjust the
conditional that calls delete(annotations, k) to skip those two identifier keys
so the structural preservation blobs remain intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 336be1fd-cca2-4c05-8387-2d9320185d2b
📒 Files selected for processing (10)
deploy/operator/api/CONVERSION.mddeploy/operator/api/roundtrip_fuzz_test.godeploy/operator/api/v1alpha1/dynamocomponentdeployment_conversion.godeploy/operator/api/v1alpha1/dynamographdeployment_conversion.godeploy/operator/api/v1alpha1/dynamographdeployment_conversion_test.godeploy/operator/api/v1alpha1/dynamographdeploymentrequest_conversion.godeploy/operator/api/v1alpha1/dynamographdeploymentrequest_conversion_test.godeploy/operator/api/v1alpha1/dynamographdeploymentrequest_legacy_conversion_test.godeploy/operator/api/v1alpha1/shared_spec_conversion.godeploy/operator/api/v1alpha1/shared_spec_conversion_bugs_test.go
| func saveDGDRHubOnlySpec(src *v1beta1.DynamoGraphDeploymentRequestSpec, save *v1beta1.DynamoGraphDeploymentRequestSpec) { | ||
| if src == nil || save == nil { | ||
| return | ||
| } | ||
| if src.Hardware != nil { | ||
| save.Hardware = src.Hardware.DeepCopy() | ||
| } | ||
| if src.Workload != nil { | ||
| save.Workload = &v1beta1.WorkloadSpec{} | ||
| if src.Workload.Concurrency != nil { | ||
| v := *src.Workload.Concurrency | ||
| save.Workload.Concurrency = &v | ||
| } | ||
| if src.Workload.RequestRate != nil { | ||
| v := *src.Workload.RequestRate | ||
| save.Workload.RequestRate = &v | ||
| } | ||
| } | ||
| if src.SLA != nil { | ||
| save.SLA = &v1beta1.SLASpec{} | ||
| if src.SLA.E2ELatency != nil { | ||
| v := *src.SLA.E2ELatency | ||
| save.SLA.E2ELatency = &v | ||
| } | ||
| } | ||
| if src.Overrides != nil { | ||
| saveDGDRHubOnlyOverrides(src.Overrides, save) | ||
| } | ||
| if src.Features != nil && src.Features.Mocker != nil && !src.Features.Mocker.Enabled { | ||
| save.Features = &v1beta1.FeaturesSpec{Mocker: &v1beta1.MockerSpec{Enabled: false}} | ||
| } | ||
| save.SearchStrategy = src.SearchStrategy | ||
| } |
There was a problem hiding this comment.
Keep the hub-only save sparse for Workload and SLA.
saveDGDRHubOnlySpec allocates save.Workload/save.SLA whenever the live hub object has those structs, even if Concurrency, RequestRate, and E2ELatency are all nil. That makes dgdrHubSpecSaveIsZero fail and writes annDGDRSpec with no hub-only payload.
Suggested tightening
func saveDGDRHubOnlySpec(src *v1beta1.DynamoGraphDeploymentRequestSpec, save *v1beta1.DynamoGraphDeploymentRequestSpec) {
if src == nil || save == nil {
return
}
if src.Hardware != nil {
save.Hardware = src.Hardware.DeepCopy()
}
- if src.Workload != nil {
- save.Workload = &v1beta1.WorkloadSpec{}
- if src.Workload.Concurrency != nil {
- v := *src.Workload.Concurrency
- save.Workload.Concurrency = &v
- }
- if src.Workload.RequestRate != nil {
- v := *src.Workload.RequestRate
- save.Workload.RequestRate = &v
- }
+ if src.Workload != nil && (src.Workload.Concurrency != nil || src.Workload.RequestRate != nil) {
+ save.Workload = &v1beta1.WorkloadSpec{}
+ if src.Workload.Concurrency != nil {
+ v := *src.Workload.Concurrency
+ save.Workload.Concurrency = &v
+ }
+ if src.Workload.RequestRate != nil {
+ v := *src.Workload.RequestRate
+ save.Workload.RequestRate = &v
+ }
}
- if src.SLA != nil {
+ if src.SLA != nil && src.SLA.E2ELatency != nil {
save.SLA = &v1beta1.SLASpec{}
- if src.SLA.E2ELatency != nil {
- v := *src.SLA.E2ELatency
- save.SLA.E2ELatency = &v
- }
+ v := *src.SLA.E2ELatency
+ save.SLA.E2ELatency = &v
}
if src.Overrides != nil {
saveDGDRHubOnlyOverrides(src.Overrides, save)
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deploy/operator/api/v1alpha1/dynamographdeploymentrequest_conversion.go`
around lines 695 - 727, saveDGDRHubOnlySpec currently allocates save.Workload
and save.SLA whenever src.Workload or src.SLA exist, even if their inner
pointers (Concurrency, RequestRate, E2ELatency) are all nil, causing
dgdrHubSpecSaveIsZero to misreport non-empty and write an empty hub payload;
change saveDGDRHubOnlySpec so it only creates and assigns save.Workload when at
least one of src.Workload.Concurrency or src.Workload.RequestRate is non-nil
(copy those fields as now), and only creates and assigns save.SLA when
src.SLA.E2ELatency is non-nil, leaving save.Workload/save.SLA nil otherwise;
keep other behavior (Hardware, Overrides, Features, SearchStrategy) unchanged.
Use sparse value preservation annotations for DGDR fields that cannot be represented directly by the target API version. Signed-off-by: Dr. Stefan Schimanski <sschimanski@nvidia.com>
Keep the pre-structural DGDR converter in tests so the legacy annotation format stays readable while downgrade compatibility is required. Add fuzz coverage that compares legacy and structural DGDR conversion on legacy-compatible visible shapes after stripping only the new sparse preservation annotations. Signed-off-by: Dr. Stefan Schimanski <sschimanski@nvidia.com>
912d1ef to
5b7b005
Compare
| if src.Hardware != nil { | ||
| save.Hardware = src.Hardware.DeepCopy() | ||
| } | ||
| if src.Workload != nil { |
There was a problem hiding this comment.
This allocates Workload/SLA whenever the source has those structs, even when only alpha representable fields are present.
we should only save and restore Workload when Concurrency or RequestRate is present, and SLA when E2ELatency is present
|
|
||
| func dgdrAlphaStatusMatchesHubPhase(state DGDRState, deployment *DeploymentStatus, phase v1beta1.DGDRPhase) bool { | ||
| alphaPhase := dgdrStateToPhase(string(state), deployment) | ||
| return alphaPhase == phase || phase == v1beta1.DGDRPhaseDeployed && alphaPhase == v1beta1.DGDRPhaseReady |
There was a problem hiding this comment.
I think this is too broad.
it should be
if alphaPhase == phase {
return true
}
return phase == v1beta1.DGDRPhaseDeployed && state == DGDRStateReady
That still preserves the intended lossy round trip:
hub Deployed -> alpha Ready -> hub Deployed
But it stops this bad stale restore:
hub Deployed -> alpha DeploymentDeleted -> hub Deployed
Summary
This PR aligns DGDR conversion with the structural DGD/DCD conversion rules from
api/CONVERSION.mdafter #9257. DGDR is already used in production since Dynamo 1.0, so the conversion keeps backward-compatible read behavior and continues writing legacy downgrade annotations while 1.2 -> 1.1 downgrade must remain supported.nvidia.com/dgdr-specandnvidia.com/dgdr-statusannotations.Compatibility and Fuzz Scope
Fixed Bugs
spec.profilingConfig.nodeSelectorthroughv1beta1.spec.overrides.profilingJob.template.spec.nodeSelector.state=Initializingandstate=DeploymentDeletedthrough v1beta1 using sparse annotations.spec.hardwarethrough v1alpha1 using sparse annotations.spec.workload.concurrencyandspec.workload.requestRatethrough v1alpha1 using sparse annotations.spec.sla.e2eLatencythrough v1alpha1 using sparse annotations.spec.overrides.dgdand hub-onlyspec.overrides.profilingJobleaves through v1alpha1 using sparse annotations.spec.searchStrategythrough v1alpha1 using sparse annotations.spec.features.mocker.enabled=falsethrough v1alpha1 without overriding liveuseMocker=true.phase=Deployedthrough v1alpha1 status conversion.status.profilingPhasethrough v1alpha1 using sparse annotations.status.profilingResults.paretothrough v1alpha1 using sparse annotations.status.deploymentInfothrough v1alpha1 using sparse annotations.Prompt History Trace
api/CONVERSION.md.legacyAnn*and mark them withTODO(sttts)for removal after 1.2.CONVERSION.mdnaming for structural converters, includingConvertFromDynamoGraphDeploymentRequest*andConvertToDynamoGraphDeploymentRequest*.Testing
GOCACHE=/tmp/dynamo-go-cache go test ./api -run "TestFuzzRoundTrip_DGDR|TestFuzzRoundTripMutability/DGDR" -roundtrip-fuzz-iters=3000 -count=1 -vGOCACHE=/tmp/dynamo-go-cache go test ./api/v1alpha1 -run TestDGDRFuzzLegacyAndStructural -dgdr-legacy-fuzz-iters=3000 -count=1 -vGOCACHE=/tmp/dynamo-go-cache go test ./api/... -count=1git diff --check