feat: add dynamic timeout retry with file-size-based parameters#468
feat: add dynamic timeout retry with file-size-based parameters#468
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the robustness of file transfer operations by introducing a dynamic and intelligent retry mechanism. It addresses previous limitations where fixed retry parameters and cascading cancellations led to frequent failures, especially with large files over unreliable networks. The new system adapts retry behavior based on file size and isolates failures, ensuring that individual transient errors do not disrupt the entire transfer process, thereby improving overall reliability and user experience. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a robust dynamic retry mechanism with file-size-based parameters, which is a significant improvement for handling transient network errors during large file transfers. The new pkg/retrypolicy is well-designed and thoroughly tested. A key architectural improvement is replacing cascading errgroup.WithContext with independent per-goroutine retries, preventing a single failure from terminating all concurrent operations. The addition of CLI flags for retry control and progress bar updates on retry significantly enhance usability. The changes are consistently applied across all data transfer paths. I have a couple of suggestions for improvement in pkg/backend/push.go to ensure a consistent user experience during retries.
| if err := retrypolicy.Do(ctx, func(rctx context.Context) error { | ||
| return pushIfNotExist(rctx, pb, internalpb.NormalizePrompt("Copying config"), src, dst, manifest.Config, repo, tag) | ||
| }, retrypolicy.DoOpts{ | ||
| FileSize: manifest.Config.Size, | ||
| FileName: "config", | ||
| Config: &cfg.RetryConfig, | ||
| }); err != nil { | ||
| return fmt.Errorf("failed to push config to remote: %w", err) | ||
| } |
There was a problem hiding this comment.
The retry logic for pushing the config object is missing an OnRetry handler. This is inconsistent with the retry logic for layers in this file and for all components in pkg/backend/pull.go. Adding an OnRetry handler will provide better visual feedback to the user during backoff periods, improving the user experience.
if err := retrypolicy.Do(ctx, func(rctx context.Context) error {
return pushIfNotExist(rctx, pb, internalpb.NormalizePrompt("Copying config"), src, dst, manifest.Config, repo, tag)
}, retrypolicy.DoOpts{
FileSize: manifest.Config.Size,
FileName: "config",
Config: &cfg.RetryConfig,
OnRetry: func(attempt uint, reason string, backoff time.Duration) {
prompt := fmt.Sprintf("%s (retry %d, %s, waiting %s)",
internalpb.NormalizePrompt("Copying config"), attempt, reason, backoff.Truncate(time.Second))
pb.Add(prompt, manifest.Config.Digest.String(), manifest.Config.Size, nil)
},
}); err != nil {
return fmt.Errorf("failed to push config to remote: %w", err)
}There was a problem hiding this comment.
OnRetry was added in a later commit on this branch; updated to use pb.Placeholder in 973f508 for consistency with pull/fetch.
| // copy the manifest. | ||
| if err := retry.Do(func() error { | ||
| return pushIfNotExist(ctx, pb, internalpb.NormalizePrompt("Copying manifest"), src, dst, ocispec.Descriptor{ | ||
| if err := retrypolicy.Do(ctx, func(rctx context.Context) error { | ||
| return pushIfNotExist(rctx, pb, internalpb.NormalizePrompt("Copying manifest"), src, dst, ocispec.Descriptor{ | ||
| MediaType: manifest.MediaType, | ||
| Size: int64(len(manifestRaw)), | ||
| Digest: godigest.FromBytes(manifestRaw), | ||
| Data: manifestRaw, | ||
| }, repo, tag) | ||
| }, append(defaultRetryOpts, retry.Context(ctx))...); err != nil { | ||
| }, retrypolicy.DoOpts{ | ||
| FileSize: int64(len(manifestRaw)), | ||
| FileName: "manifest", | ||
| Config: &cfg.RetryConfig, | ||
| }); err != nil { | ||
| return fmt.Errorf("failed to push manifest to remote: %w", err) | ||
| } |
There was a problem hiding this comment.
Similar to the config push, the retry logic for pushing the manifest is missing an OnRetry handler. For consistency and better user experience, a handler should be added to update the progress bar during retries. I've also refactored the descriptor creation to make it accessible in the OnRetry callback.
// copy the manifest.
manifestDesc := ocispec.Descriptor{
MediaType: manifest.MediaType,
Size: int64(len(manifestRaw)),
Digest: godigest.FromBytes(manifestRaw),
Data: manifestRaw,
}
if err := retrypolicy.Do(ctx, func(rctx context.Context) error {
return pushIfNotExist(rctx, pb, internalpb.NormalizePrompt("Copying manifest"), src, dst, manifestDesc, repo, tag)
}, retrypolicy.DoOpts{
FileSize: manifestDesc.Size,
FileName: "manifest",
Config: &cfg.RetryConfig,
OnRetry: func(attempt uint, reason string, backoff time.Duration) {
prompt := fmt.Sprintf("%s (retry %d, %s, waiting %s)",
internalpb.NormalizePrompt("Copying manifest"), attempt, reason, backoff.Truncate(time.Second))
pb.Add(prompt, manifestDesc.Digest.String(), manifestDesc.Size, nil)
},
}); err != nil {
return fmt.Errorf("failed to push manifest to remote: %w", err)
}There was a problem hiding this comment.
OnRetry was added in a later commit; updated to use pb.Placeholder in 973f508.
9f96098 to
1369e47
Compare
|
/gemini review |
|
@codex-code-assist review |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant improvement to the application's resilience by implementing a dynamic, file-size-based retry mechanism for data transfers. The new pkg/retrypolicy is well-designed and replaces the previous fixed retry logic. A key architectural change is the move away from cascading cancellations in errgroup, which will prevent a single failure from halting all concurrent operations. The addition of CLI flags for retry control is also a valuable feature. My review focuses on a critical race condition in the progress bar handling, a minor bug in the retry backoff calculation for logging, and opportunities to reduce code duplication for better maintainability.
| p.mu.RLock() | ||
| existing := p.bars[name] | ||
| p.mu.RUnlock() | ||
|
|
||
| // If the bar already exists, just reset its message. | ||
| if existing != nil { | ||
| existing.msg = fmt.Sprintf("%s %s", prompt, name) | ||
| existing.Bar.SetCurrent(0) | ||
| return | ||
| } | ||
|
|
||
| // Create a new placeholder bar. | ||
| p.Add(prompt, name, size, nil) |
There was a problem hiding this comment.
There is a data race condition here. The existing.msg field is being modified without a lock, while it can be concurrently read by the progress bar's rendering goroutine. This can lead to unpredictable behavior or crashes.
To fix this, you should use a write lock to protect both the read from the p.bars map and the subsequent write to the msg field. The lock should be released before calling p.Add to avoid deadlocks, as p.Add acquires its own locks.
| p.mu.RLock() | |
| existing := p.bars[name] | |
| p.mu.RUnlock() | |
| // If the bar already exists, just reset its message. | |
| if existing != nil { | |
| existing.msg = fmt.Sprintf("%s %s", prompt, name) | |
| existing.Bar.SetCurrent(0) | |
| return | |
| } | |
| // Create a new placeholder bar. | |
| p.Add(prompt, name, size, nil) | |
| p.mu.Lock() | |
| if existing, ok := p.bars[name]; ok { | |
| // If the bar already exists, just reset its message. | |
| existing.msg = fmt.Sprintf("%s %s", prompt, name) | |
| existing.Bar.SetCurrent(0) | |
| p.mu.Unlock() | |
| return | |
| } | |
| p.mu.Unlock() | |
| // Create a new placeholder bar. | |
| p.Add(prompt, name, size, nil) |
| var annoFilepath string | ||
| if layer.Annotations != nil { | ||
| if layer.Annotations[modelspec.AnnotationFilepath] != "" { | ||
| annoFilepath = layer.Annotations[modelspec.AnnotationFilepath] | ||
| } else { | ||
| annoFilepath = layer.Annotations[legacymodelspec.AnnotationFilepath] | ||
| } | ||
| } |
There was a problem hiding this comment.
This logic to extract the filepath from annotations is duplicated in pkg/backend/fetch_by_d7y.go and pkg/backend/pull_by_d7y.go. To improve maintainability and reduce code duplication, consider extracting this into a shared helper function within the backend package.
For example, you could define this function:
func getAnnotationFilepath(annotations map[string]string) string {
if annotations == nil {
return ""
}
if path := annotations[modelspec.AnnotationFilepath]; path != "" {
return path
}
return annotations[legacymodelspec.AnnotationFilepath]
}And then call it here.
| var annoFilepath string | |
| if layer.Annotations != nil { | |
| if layer.Annotations[modelspec.AnnotationFilepath] != "" { | |
| annoFilepath = layer.Annotations[modelspec.AnnotationFilepath] | |
| } else { | |
| annoFilepath = layer.Annotations[legacymodelspec.AnnotationFilepath] | |
| } | |
| } | |
| annoFilepath := getAnnotationFilepath(layer.Annotations) |
There was a problem hiding this comment.
Extracted to backend.getAnnotationFilepath in 514f604; all six call sites now use the helper.
| var annoFilepath string | ||
| if desc.Annotations != nil { | ||
| if desc.Annotations[modelspec.AnnotationFilepath] != "" { | ||
| annoFilepath = desc.Annotations[modelspec.AnnotationFilepath] | ||
| } else { | ||
| annoFilepath = desc.Annotations[legacymodelspec.AnnotationFilepath] | ||
| } | ||
| } |
There was a problem hiding this comment.
This logic to extract the filepath from annotations is duplicated in pkg/backend/fetch.go and pkg/backend/pull_by_d7y.go. To improve maintainability and reduce code duplication, consider extracting this into a shared helper function within the backend package.
For example, you could define this function:
func getAnnotationFilepath(annotations map[string]string) string {
if annotations == nil {
return ""
}
if path := annotations[modelspec.AnnotationFilepath]; path != "" {
return path
}
return annotations[legacymodelspec.AnnotationFilepath]
}And then call it here.
| var annoFilepath string | |
| if desc.Annotations != nil { | |
| if desc.Annotations[modelspec.AnnotationFilepath] != "" { | |
| annoFilepath = desc.Annotations[modelspec.AnnotationFilepath] | |
| } else { | |
| annoFilepath = desc.Annotations[legacymodelspec.AnnotationFilepath] | |
| } | |
| } | |
| annoFilepath := getAnnotationFilepath(desc.Annotations) |
There was a problem hiding this comment.
Extracted to backend.getAnnotationFilepath in 514f604.
| var annoFilepath string | ||
| if desc.Annotations != nil { | ||
| if desc.Annotations[modelspec.AnnotationFilepath] != "" { | ||
| annoFilepath = desc.Annotations[modelspec.AnnotationFilepath] | ||
| } else { | ||
| annoFilepath = desc.Annotations[legacymodelspec.AnnotationFilepath] | ||
| } | ||
| } |
There was a problem hiding this comment.
This logic to extract the filepath from annotations is duplicated in pkg/backend/fetch.go and pkg/backend/fetch_by_d7y.go. To improve maintainability and reduce code duplication, consider extracting this into a shared helper function within the backend package.
For example, you could define this function:
func getAnnotationFilepath(annotations map[string]string) string {
if annotations == nil {
return ""
}
if path := annotations[modelspec.AnnotationFilepath]; path != "" {
return path
}
return annotations[legacymodelspec.AnnotationFilepath]
}And then call it here.
| var annoFilepath string | |
| if desc.Annotations != nil { | |
| if desc.Annotations[modelspec.AnnotationFilepath] != "" { | |
| annoFilepath = desc.Annotations[modelspec.AnnotationFilepath] | |
| } else { | |
| annoFilepath = desc.Annotations[legacymodelspec.AnnotationFilepath] | |
| } | |
| } | |
| annoFilepath := getAnnotationFilepath(desc.Annotations) |
There was a problem hiding this comment.
Extracted to backend.getAnnotationFilepath in 514f604.
| return retryable | ||
| }), | ||
| retry.OnRetry(func(n uint, err error) { | ||
| backoff := computeBackoff(n+1, initialDelay, maxBackoff) |
There was a problem hiding this comment.
The computeBackoff function is being called with n+1, but it seems to expect the 1-based retry attempt number, which is n. The retry-go library's OnRetry callback provides n as the 1-based retry attempt number. Passing n+1 causes the logged backoff duration to be one step ahead of the actual backoff being used by the retry library.
To ensure the logged backoff duration is accurate, you should call computeBackoff with n.
| backoff := computeBackoff(n+1, initialDelay, maxBackoff) | |
| backoff := computeBackoff(n, initialDelay, maxBackoff) |
There was a problem hiding this comment.
Fixed in f155ccb — computeBackoff now takes n directly.
| if attempt == 0 { | ||
| return initial | ||
| } |
There was a problem hiding this comment.
There was a problem hiding this comment.
Dropped in f155ccb; the test was updated to skip attempt=0 for the same reason.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eb1fc1ebc2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Unknown errors default to retryable. | ||
| return true |
There was a problem hiding this comment.
Stop retrying unclassified errors as transient
This fallback makes every unexpected error retryable, and the new call sites now feed retrypolicy.Do with arbitrary local failures from paths like builder.BuildLayer and extractLayer. A permanent problem such as permission denied or no space left on device will now sit in the retry loop until the size-based deadline expires (10 minutes for small files, 60 minutes for blobs ≥10 GiB), so broken builds/pulls can appear hung for a long time instead of failing promptly.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good point — the permanent-vs-transient split needs a deliberate allowlist (network timeouts, 5xx, connection resets, rate-limit responses, ...) rather than ad-hoc narrowing. Leaving for follow-up; I'll send a dedicated PR once we agree on the list so this one can land without scope creep.
| retryErr := retrypolicy.Do(ctx, func(retryCtx context.Context) error { | ||
| logrus.Debugf("pull: processing layer %s", layer.Digest) | ||
| // call the before hook. | ||
| cfg.Hooks.BeforePullLayer(layer, manifest) |
There was a problem hiding this comment.
Thread retryCtx through layer pull operations
The per-attempt retryCtx created by retrypolicy.Do is not actually used for the layer transfer here: fn closes over the outer command context, so both pullIfNotExist and pullAndExtractFromRemote still run with ctx. In the normal pull path, any registry read/write that only aborts on context cancellation can therefore ignore --retry-max-time (or the dynamic size-based limit) and block the whole layer instead of timing out and retrying.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
I think this analysis may have been tripped up by variable shadowing: fn declares its own ctx parameter, and at the call site we pass retryCtx — pullIfNotExist/pullAndExtractFromRemote run with retryCtx, not the outer command context. Let me know if I'm missing a path where that's not the case.
|
/gemini review |
|
@codex-code-assist review |
There was a problem hiding this comment.
Code Review
This pull request introduces a robust and dynamic retry mechanism based on file size, significantly improving the reliability of data transfer operations. The refactoring to use a dedicated pkg/retrypolicy package centralizes retry logic, and the move to independent per-goroutine retries effectively addresses the cascading cancellation issue. The addition of CLI flags for retry configuration provides users with greater control. The comprehensive unit tests for the new retry policy are commendable, ensuring the correctness of the dynamic parameter calculations and error handling.
Overall, this is a well-designed and implemented feature that directly tackles a critical pain point for large file transfers.
| OnRetry: func(attempt uint, reason string, backoff time.Duration) { | ||
| prompt := fmt.Sprintf("%s (retry %d, %s, waiting %s)", | ||
| internalpb.NormalizePrompt("Copying blob"), attempt, reason, backoff.Truncate(time.Second)) | ||
| pb.Add(prompt, layer.Digest.String(), layer.Size, nil) | ||
| }, |
There was a problem hiding this comment.
For updating the progress bar during a retry, pb.Placeholder is more semantically appropriate than pb.Add. The Placeholder method is designed to update an existing bar's message and reset its current progress to zero, which is ideal for indicating a backoff period before the next attempt. This approach is already used consistently in pull.go and fetch.go.
Please apply this change to the OnRetry callbacks for layers, config, and manifest pushes for consistency and clearer UX.
For example, the OnRetry for layers could be updated as follows:
| OnRetry: func(attempt uint, reason string, backoff time.Duration) { | |
| prompt := fmt.Sprintf("%s (retry %d, %s, waiting %s)", | |
| internalpb.NormalizePrompt("Copying blob"), attempt, reason, backoff.Truncate(time.Second)) | |
| pb.Add(prompt, layer.Digest.String(), layer.Size, nil) | |
| }, | |
| OnRetry: func(attempt uint, reason string, backoff time.Duration) { | |
| prompt := fmt.Sprintf("%s (retry %d, %s, waiting %s)", | |
| internalpb.NormalizePrompt("Copying blob"), attempt, reason, backoff.Truncate(time.Second)) | |
| pb.Placeholder(layer.Digest.String(), prompt, layer.Size) | |
| }, |
There was a problem hiding this comment.
Applied to layer, config, and manifest retries in 973f508.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f3fbdfe46a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| g.Wait() | ||
| if len(errs) > 0 { | ||
| return fmt.Errorf("failed to push blob to remote: %w", errors.Join(errs...)) |
There was a problem hiding this comment.
Propagate layer cancellation before pushing config
When ctx is cancelled after some layer uploads finish but before the remaining goroutines start (common when cfg.Concurrency < len(manifest.Layers)), those queued workers return ctx.Err() from the early select, but g.Wait() is ignored here. If the already-running uploads succeeded, errs stays empty and Push continues on to the config/manifest uploads, so the command can report success after skipping layers and may publish an incomplete artifact.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
g.Wait() now captures its return into errs (973f508), so a cancelled batch can no longer slip through to the config/manifest push with an incomplete layer set.
| _ = g.Wait() | ||
| if len(errs) > 0 { | ||
| return fmt.Errorf("failed to pull blob to local: %w", errors.Join(errs...)) |
There was a problem hiding this comment.
Propagate layer cancellation before finalizing pull
If the command context is cancelled between layer batches, the queued workers exit via the top-of-goroutine ctx.Done() check, but this _ = g.Wait() drops that cancellation and errs remains empty. In extract-from-remote mode that means pull can return success after extracting only a subset of files; in normal mode it can also proceed to fetch config/manifest after skipping some layers, leaving an incomplete local artifact.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Same defensive capture applied to pull in 973f508.
| select { | ||
| case <-ctx.Done(): | ||
| return ctx.Err() | ||
| return nil |
There was a problem hiding this comment.
Return context cancellation from skipped build workers
Returning nil here means any worker that never starts because the build context was already cancelled is treated as a success. Combined with the ignored eg.Wait() result below, cancelling a build after some files finish but before the rest start lets Process return no error with only a partial descriptors set, so Build can emit an apparently successful artifact that is missing files from the worktree.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 973f508: the select now returns ctx.Err() and eg.Wait()'s return is captured into errs, so a cancelled build can no longer return a partial descriptor set as a success.
|
@codex-code-assist review |
6ec310c to
1f6c0f5
Compare
…ameters Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
…ld/fetch Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
…ading cancellation Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
…artifacts Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
… tests Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
1f6c0f5 to
fcdbcb0
Compare
retry-go's OnRetry callback supplies a 1-based retry attempt number, so computeBackoff(n+1, ...) logged a delay that was one doubling ahead of the actual backoff used by the retry loop. Pass n directly. Also drop the unreachable attempt == 0 branch in computeBackoff (retry-go never supplies 0) and skip that case in the test. Addresses review feedback on PR #468 (gemini). Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
The 'prefer modelspec.AnnotationFilepath, fall back to the legacy dragonflyoss key' pattern was duplicated across fetch.go, fetch_by_d7y.go, and pull_by_d7y.go (six call sites total). Centralize it in backend.getAnnotationFilepath so each caller is a one-liner and future changes to the annotation resolution live in one place. Addresses review feedback on PR #468 (gemini). Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
Three call sites (push, pull, processor) used 'g.Wait()' without capturing its return, so a cancelled worker's ctx.Err() could be discarded. In the edge case where all running uploads/layers succeed but queued-but-unstarted workers exit via the ctx.Done() select, errs stayed empty and the operation continued to config/manifest push (or finalized a partial descriptor set), potentially publishing an incomplete artifact. Capture the Wait() result into errs and rely on the existing ctx.Err()/len(errs) checks to return cleanly. Also change processor/base.go's select from 'return nil' to 'return ctx.Err()' so the same propagation path applies there. Additionally, align push.go's OnRetry with pull.go/fetch.go by using pb.Placeholder (designed to reset an existing bar's message and progress) instead of pb.Add for layer, config, and manifest retries. Addresses review feedback on PR #468 (codex P1, gemini). Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
The previous design used a single MaxRetryTime budget to bound both
the in-flight transfer and the inter-attempt sleeps. With a wall-clock
that scales with file size, a slow first attempt can consume the whole
budget, leaving no room for retries — exactly when retries matter most.
Split into two independent knobs:
* PerAttemptTimeout: derived from file size assuming a 10 MiB/s
minimum throughput with a 2x safety factor, clamped to [5min, 8h].
Each attempt gets its own context.WithTimeout, cancelled per
attempt.
* MaxAttempts + MaxBackoff: bound retry-only behavior. Defaults are
constants (6 attempts, 2min backoff cap) and do not scale with
file size — transient-failure recovery time is payload-independent.
The retry loop classifies a per-attempt DeadlineExceeded under a live
parent context as retryable, so a single transfer timeout no longer
short-circuits the loop. User cancellation (parent ctx) still aborts
immediately.
CLI:
--retry-max-time removed (semantics conflated transfer and retry)
--retry-attempts new (int, total attempts including initial)
--per-attempt-timeout new (Duration; 0 = derive from size, <0 = off)
--no-retry unchanged
Tests cover the size→timeout mapping with clamps, the per-attempt
deadline retry loop, parent-cancel propagation, and a size-invariance
test that pins the design's core property: total retry wall-clock does
not depend on file size.
Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
|
@codex-code-assist review |
No behavioral change. golangci-lint v2.5 (gci, golines with line length 122) flagged whitespace alignment in struct literals and long single-line table-driven test entries; auto-fixed via make lint-fix. Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fc3ac13156
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if cfg.NoRetry { | ||
| return fn(ctx) | ||
| } |
There was a problem hiding this comment.
Honor per-attempt timeout when retries are disabled
Do returns early when NoRetry is true, so it calls fn(ctx) without applying PerAttemptTimeout. This means --no-retry silently disables the per-attempt deadline instead of only disabling extra attempts, and a single transfer can hang indefinitely unless the caller set its own context deadline. Users who combine --no-retry with --per-attempt-timeout will not get the timeout behavior they requested.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — fixed. The per-attempt deadline now applies on the NoRetry path too (single-attempt) since it's a transfer-bound concern, not a retry concern. Added TestDo_NoRetryHonorsPerAttemptTimeout to lock the invariant.
The previous Do() short-circuited NoRetry by calling fn(ctx) directly, bypassing the per-attempt deadline derivation. As Codex flagged, this made --no-retry silently disable the transfer timeout: a hung connection would never terminate, leaving users with a stalled CLI and no failure signal. Per-attempt timeout is a transfer-bound concern, not a retry concern, so it must apply on the single-attempt path too. Refactor the deadline setup into a small runAttempt closure used by both the NoRetry branch and retry.Do. Add TestDo_NoRetryHonorsPerAttemptTimeout to pin this invariant. Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a specialized retrypolicy package to manage blob transfer retries, decoupling per-attempt timeouts from the retry backoff strategy. It adds CLI flags for retry configuration across the build, fetch, pull, and push commands and refactors the backend to use this new policy, improving error handling and progress tracking. A Placeholder method was added to the progress bar to handle resets during retries. Feedback suggests centralizing the progress bar reset logic within the Placeholder method and adopting the go-humanize library for byte formatting to eliminate redundant code.
| if existing != nil { | ||
| existing.msg = fmt.Sprintf("%s %s", prompt, name) | ||
| existing.Bar.SetCurrent(0) | ||
| return |
There was a problem hiding this comment.
The Placeholder method should encapsulate the full reset logic used during retries, including setting the refill value and resetting the EWMA speed calculation. This ensures consistency across different transfer paths and avoids manual bar manipulation in the backend packages.
| if existing != nil { | |
| existing.msg = fmt.Sprintf("%s %s", prompt, name) | |
| existing.Bar.SetCurrent(0) | |
| return | |
| if existing != nil { | |
| existing.msg = fmt.Sprintf("%s %s", prompt, name) | |
| existing.SetRefill(existing.Current()) | |
| existing.SetCurrent(0) | |
| existing.EwmaSetCurrent(0, time.Second) | |
| return | |
| } |
| OnRetry: func(attempt uint, reason string, backoff time.Duration) { | ||
| if bar := pb.Get(layer.Digest.String()); bar != nil { | ||
| bar.SetRefill(bar.Current()) | ||
| bar.SetCurrent(0) | ||
| bar.EwmaSetCurrent(0, time.Second) | ||
| } | ||
| }, |
There was a problem hiding this comment.
Use the pb.Placeholder method to reset the progress bar during retries. This avoids manual manipulation of the progress bar and ensures that the reset logic (including refill and EWMA reset) is applied consistently.
OnRetry: func(attempt uint, reason string, backoff time.Duration) {
pb.Placeholder(layer.Digest.String(), internalpb.NormalizePrompt("Fetching blob"), layer.Size)
},| OnRetry: func(attempt uint, reason string, backoff time.Duration) { | ||
| if bar := pb.Get(desc.Digest.String()); bar != nil { | ||
| bar.SetRefill(bar.Current()) | ||
| bar.SetCurrent(0) | ||
| bar.EwmaSetCurrent(0, time.Second) | ||
| } | ||
| }, |
There was a problem hiding this comment.
| OnRetry: func(attempt uint, reason string, backoff time.Duration) { | ||
| if bar := pb.Get(desc.Digest.String()); bar != nil { | ||
| bar.SetRefill(bar.Current()) | ||
| bar.SetCurrent(0) | ||
| bar.EwmaSetCurrent(0, time.Second) | ||
| } | ||
| }, |
There was a problem hiding this comment.
| import ( | ||
| "context" | ||
| "errors" | ||
| "fmt" | ||
| "math" | ||
| "regexp" | ||
| "strings" | ||
| "time" | ||
|
|
||
| retry "github.com/avast/retry-go/v4" | ||
| log "github.com/sirupsen/logrus" | ||
| ) |
There was a problem hiding this comment.
Add github.com/dustin/go-humanize to the imports to use it for byte formatting, avoiding duplication of the humanizeBytes function.
| import ( | |
| "context" | |
| "errors" | |
| "fmt" | |
| "math" | |
| "regexp" | |
| "strings" | |
| "time" | |
| retry "github.com/avast/retry-go/v4" | |
| log "github.com/sirupsen/logrus" | |
| ) | |
| import ( | |
| "context" | |
| "errors" | |
| "fmt" | |
| "math" | |
| "regexp" | |
| "strings" | |
| "time" | |
| retry "github.com/avast/retry-go/v4" | |
| humanize "github.com/dustin/go-humanize" | |
| log "github.com/sirupsen/logrus" | |
| ) |
References
- To avoid code duplication, use functions from existing dependencies for common tasks (e.g., byte formatting) instead of reimplementing them across multiple packages.
| func humanizeBytes(b int64) string { | ||
| const ( | ||
| kb = 1024 | ||
| mb = 1024 * kb | ||
| gb = 1024 * mb | ||
| ) | ||
|
|
||
| switch { | ||
| case b >= gb: | ||
| return fmt.Sprintf("%.1f GB", float64(b)/float64(gb)) | ||
| case b >= mb: | ||
| return fmt.Sprintf("%.1f MB", float64(b)/float64(mb)) | ||
| case b >= kb: | ||
| return fmt.Sprintf("%.1f KB", float64(b)/float64(kb)) | ||
| default: | ||
| return fmt.Sprintf("%d B", b) | ||
| } | ||
| } |
There was a problem hiding this comment.
To avoid code duplication and maintain consistency, use the existing github.com/dustin/go-humanize dependency for byte formatting instead of reimplementing humanizeBytes. This follows the general rule for common tasks.
func humanizeBytes(b int64) string {
return humanize.Bytes(uint64(b))
}References
- To avoid code duplication, use functions from existing dependencies for common tasks (e.g., byte formatting) instead of reimplementing them across multiple packages.
The --no-retry flag was redundant once the retry budget was decoupled from the per-attempt timeout: it expressed nothing that --retry-attempts=1 cannot. Keeping both invited the class of bug Codex found earlier (NoRetry short-circuited Do() and bypassed the per-attempt deadline) — having two ways to say "don't retry" makes it easy to wire only one of them through correctly. After this commit there are exactly two retry-related CLI flags: --retry-attempts int (default 0 → 6; set to 1 for fail-fast) --per-attempt-timeout duration (0 → derive from size; <0 → disabled) Each controls one orthogonal concern. "No retry" is a value of one of them, not a separate flag. BREAKING CHANGE: --no-retry is removed. Replace with --retry-attempts=1. The flag was introduced in this same PR and never shipped, so external breakage scope is just the PR's own iteration history. Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
Remove --retry-attempts and --per-attempt-timeout. Both controlled
operational behavior (retry tolerance, network speed assumption) that
practically never varies per invocation, so they didn't earn a slot
on the CLI. Keeping them invited misuse and grew help-text noise for
no measurable benefit.
After this commit, retry behavior is fully determined by package
defaults in pkg/retrypolicy:
- 6 attempts (DefaultMaxAttempts)
- 5s initial delay, exponential backoff capped at 2 min
- Per-attempt timeout derived from file size (10 MiB/s minimum
throughput, 2x safety, clamped to [5min, 8h])
The retrypolicy.Config struct stays public for programmatic users
embedding modctl in another binary; it just isn't wired to any CLI
flag. If a real user shows up with a network where the defaults break
down, we can add a knob then with their case as the documented
example — until then, YAGNI wins.
Cleanup:
- cmd/{push,pull,build,fetch}.go: drop flag registrations.
- pkg/config/{push,pull,build,fetch}.go: drop RetryConfig field
and the now-unused retrypolicy import.
- pkg/backend/{push,pull,fetch,build,*_by_d7y}.go: drop
Config: &cfg.RetryConfig from each retrypolicy.Do call.
- pkg/backend/processor/options.go: drop WithRetryConfig and the
retryConfig field; processor.Process no longer takes a retry
option.
- pkg/backend/processor/base.go: stop threading processOpts.retryConfig
into retrypolicy.DoOpts.
BREAKING CHANGE: --retry-attempts and --per-attempt-timeout are
removed. Both were introduced earlier in this same PR and never
shipped.
Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
Summary
pkg/retrypolicypackage with decoupled per-attempt timeout and retry budget:[5 min, 8 h]). Each attempt gets its own deadline.errgroup.WithContextcancellation with independent per-goroutine retry — one layer failure no longer cancels siblings.retrypolicy.Configstruct stays available for programmatic embedders.Motivation
When pushing large model files (multi-GB to multi-TB) to OCI registries backed by rate-limited storage (e.g., Harbor + OSS),
i/o timeouterrors frequently occur. The previous retry mechanism had three problems:errgroup.WithContextmeant one timeout killed all in-flight transfers.MaxRetryTimebudget, but it covered both in-flight transfer time and inter-attempt sleeps. With a wall-clock that scales with file size, a slow first attempt could consume the whole budget, leaving no room for retries — exactly when retries matter most.The current design splits the timing concerns into two independent constants, with no per-invocation overrides. We considered exposing them as CLI flags (
--retry-attempts,--per-attempt-timeout) but found they were operational settings that rarely vary per invocation; YAGNI says skip until a real user case shows up.Design
ComputePerAttemptTimeout(fileSize)—file_size / 10 MiB/s × 2, clamped to[5 min, 8 h]DefaultMaxAttempts = 6DefaultMaxBackoff = 2 minDefaultInitialDelay = 5 sPer-attempt deadlines are derived inside
Do()viacontext.WithTimeout(ctx, perAttemptTimeout); the parentctxis reserved for user cancellation. ADeadlineExceedederror under a live parent context is reclassified as retryable, so a single transfer timeout no longer short-circuits the retry loop.Examples (
ComputePerAttemptTimeout):Changes
pkg/retrypolicy/Do(),IsRetryable(),ShortReason(),ComputePerAttemptTimeout()pkg/backend/push.goretrypolicy.Dofor config/manifestpkg/backend/pull.gopb.Placeholder()for retry progress displaypkg/backend/fetch.gopkg/backend/build.gopkg/backend/processor/base.gopkg/backend/pull_by_d7y.go,fetch_by_d7y.gointernal/pb/pb.goPlaceholder()method for retry backoff displaypkg/backend/retry.godefaultRetryOpts)No new CLI flags or config fields.
Test plan
pkg/retrypolicy/retrypolicy_test.go, including a size-invariance test that pins the design's core property: total retry wall-clock does not depend on file size.go vet ./...clean.go test -race ./pkg/retrypolicy/...clean.make lintclean (golangci-lint v2.5.0).