Skip to content

feat: add dynamic timeout retry with file-size-based parameters#468

Draft
aftersnow wants to merge 16 commits intomainfrom
worktree-timeout-retry
Draft

feat: add dynamic timeout retry with file-size-based parameters#468
aftersnow wants to merge 16 commits intomainfrom
worktree-timeout-retry

Conversation

@aftersnow
Copy link
Copy Markdown
Contributor

@aftersnow aftersnow commented Mar 23, 2026

Summary

  • Add pkg/retrypolicy package with decoupled per-attempt timeout and retry budget:
    • Per-attempt timeout derives from file size (assumed 10 MiB/s minimum throughput, 2x safety factor, clamped to [5 min, 8 h]). Each attempt gets its own deadline.
    • Total attempt count and per-sleep cap are constants and do not scale with file size.
  • Replace cascading errgroup.WithContext cancellation with independent per-goroutine retry — one layer failure no longer cancels siblings.
  • Apply retry consistently across all data transfer paths: push, pull, fetch, build, and Dragonfly variants.
  • No new CLI flags. Retry behavior is determined entirely by package defaults; the public retrypolicy.Config struct stays available for programmatic embedders.

Motivation

When pushing large model files (multi-GB to multi-TB) to OCI registries backed by rate-limited storage (e.g., Harbor + OSS), i/o timeout errors frequently occur. The previous retry mechanism had three problems:

  1. Fixed retry parameters didn't account for file size — large files need longer per-attempt timeouts.
  2. Cascading cancellation via errgroup.WithContext meant one timeout killed all in-flight transfers.
  3. Coupled timing semantics — an earlier iteration introduced a single MaxRetryTime budget, but it covered both in-flight transfer time and inter-attempt sleeps. With a wall-clock that scales with file size, a slow first attempt could consume the whole budget, leaving no room for retries — exactly when retries matter most.

The current design splits the timing concerns into two independent constants, with no per-invocation overrides. We considered exposing them as CLI flags (--retry-attempts, --per-attempt-timeout) but found they were operational settings that rarely vary per invocation; YAGNI says skip until a real user case shows up.

Design

Concern Where it lives Scales with file size?
Per-attempt deadline ComputePerAttemptTimeout(fileSize)file_size / 10 MiB/s × 2, clamped to [5 min, 8 h] Yes
Total attempts (initial + retries) DefaultMaxAttempts = 6 No
Per-sleep cap DefaultMaxBackoff = 2 min No
Initial backoff DefaultInitialDelay = 5 s No

Per-attempt deadlines are derived inside Do() via context.WithTimeout(ctx, perAttemptTimeout); the parent ctx is reserved for user cancellation. A DeadlineExceeded error under a live parent context is reclassified as retryable, so a single transfer timeout no longer short-circuits the retry loop.

Examples (ComputePerAttemptTimeout):

File size Per-attempt timeout
≤ 1 GB 5 min (floor)
10 GB ~34 min
70 GB ~4 h
≥ 140 GB 8 h (ceiling)

Changes

Area What changed
pkg/retrypolicy/ New package: Do(), IsRetryable(), ShortReason(), ComputePerAttemptTimeout()
pkg/backend/push.go Independent retry per layer, retrypolicy.Do for config/manifest
pkg/backend/pull.go Same pattern + pb.Placeholder() for retry progress display
pkg/backend/fetch.go Added retry (previously had none)
pkg/backend/build.go Config/manifest build wrapped in retry
pkg/backend/processor/base.go Removed cascading cancellation, added per-file retry
pkg/backend/pull_by_d7y.go, fetch_by_d7y.go Dragonfly paths use retry
internal/pb/pb.go Placeholder() method for retry backoff display
pkg/backend/retry.go Deleted (legacy defaultRetryOpts)

No new CLI flags or config fields.

Test plan

  • Unit tests in pkg/retrypolicy/retrypolicy_test.go, including a size-invariance test that pins the design's core property: total retry wall-clock does not depend on file size.
  • go vet ./... clean.
  • go test -race ./pkg/retrypolicy/... clean.
  • make lint clean (golangci-lint v2.5.0).
  • E2E workflow passed on GitHub Actions.
  • Manual large-file push test against rate-limited registry.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of file transfer operations by introducing a dynamic and intelligent retry mechanism. It addresses previous limitations where fixed retry parameters and cascading cancellations led to frequent failures, especially with large files over unreliable networks. The new system adapts retry behavior based on file size and isolates failures, ensuring that individual transient errors do not disrupt the entire transfer process, thereby improving overall reliability and user experience.

Highlights

  • Dynamic Retry Policy: Implemented a new pkg/retrypolicy package that provides dynamic retry parameters (max retry time and max backoff) based on file size. Small files (<=1GB) use 10min/1min backoff, large files (>=10GB) use 60min/10min backoff, with linear interpolation for sizes in between.
  • Independent Per-Goroutine Retry: Replaced the cascading cancellation mechanism of errgroup.WithContext with independent retry logic for each concurrent operation (e.g., layer transfer). This ensures that a transient failure in one operation does not cancel all other in-flight transfers.
  • New CLI Flags: Added --no-retry to disable retries and --retry-max-time to set a custom maximum total retry duration per file (0 for dynamic) to the push, pull, build, and fetch commands.
  • Consistent Retry Application: Applied the new dynamic retry logic consistently across all data transfer paths, including push, pull, fetch, build, and their respective Dragonfly variants.
  • Progress Bar Integration: Introduced a Placeholder() method in the progress bar utility to display retry backoff messages, providing better user feedback during transient network issues.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust dynamic retry mechanism with file-size-based parameters, which is a significant improvement for handling transient network errors during large file transfers. The new pkg/retrypolicy is well-designed and thoroughly tested. A key architectural improvement is replacing cascading errgroup.WithContext with independent per-goroutine retries, preventing a single failure from terminating all concurrent operations. The addition of CLI flags for retry control and progress bar updates on retry significantly enhance usability. The changes are consistently applied across all data transfer paths. I have a couple of suggestions for improvement in pkg/backend/push.go to ensure a consistent user experience during retries.

Comment thread pkg/backend/push.go
Comment on lines +128 to 136
if err := retrypolicy.Do(ctx, func(rctx context.Context) error {
return pushIfNotExist(rctx, pb, internalpb.NormalizePrompt("Copying config"), src, dst, manifest.Config, repo, tag)
}, retrypolicy.DoOpts{
FileSize: manifest.Config.Size,
FileName: "config",
Config: &cfg.RetryConfig,
}); err != nil {
return fmt.Errorf("failed to push config to remote: %w", err)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The retry logic for pushing the config object is missing an OnRetry handler. This is inconsistent with the retry logic for layers in this file and for all components in pkg/backend/pull.go. Adding an OnRetry handler will provide better visual feedback to the user during backoff periods, improving the user experience.

if err := retrypolicy.Do(ctx, func(rctx context.Context) error {
		return pushIfNotExist(rctx, pb, internalpb.NormalizePrompt("Copying config"), src, dst, manifest.Config, repo, tag)
	}, retrypolicy.DoOpts{
		FileSize: manifest.Config.Size,
		FileName: "config",
		Config:   &cfg.RetryConfig,
		OnRetry: func(attempt uint, reason string, backoff time.Duration) {
			prompt := fmt.Sprintf("%s (retry %d, %s, waiting %s)",
				internalpb.NormalizePrompt("Copying config"), attempt, reason, backoff.Truncate(time.Second))
			pb.Add(prompt, manifest.Config.Digest.String(), manifest.Config.Size, nil)
		},
	}); err != nil {
		return fmt.Errorf("failed to push config to remote: %w", err)
	}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OnRetry was added in a later commit on this branch; updated to use pb.Placeholder in 973f508 for consistency with pull/fetch.

Comment thread pkg/backend/push.go
Comment on lines 138 to 152
// copy the manifest.
if err := retry.Do(func() error {
return pushIfNotExist(ctx, pb, internalpb.NormalizePrompt("Copying manifest"), src, dst, ocispec.Descriptor{
if err := retrypolicy.Do(ctx, func(rctx context.Context) error {
return pushIfNotExist(rctx, pb, internalpb.NormalizePrompt("Copying manifest"), src, dst, ocispec.Descriptor{
MediaType: manifest.MediaType,
Size: int64(len(manifestRaw)),
Digest: godigest.FromBytes(manifestRaw),
Data: manifestRaw,
}, repo, tag)
}, append(defaultRetryOpts, retry.Context(ctx))...); err != nil {
}, retrypolicy.DoOpts{
FileSize: int64(len(manifestRaw)),
FileName: "manifest",
Config: &cfg.RetryConfig,
}); err != nil {
return fmt.Errorf("failed to push manifest to remote: %w", err)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the config push, the retry logic for pushing the manifest is missing an OnRetry handler. For consistency and better user experience, a handler should be added to update the progress bar during retries. I've also refactored the descriptor creation to make it accessible in the OnRetry callback.

// copy the manifest.
	manifestDesc := ocispec.Descriptor{
		MediaType: manifest.MediaType,
		Size:      int64(len(manifestRaw)),
		Digest:    godigest.FromBytes(manifestRaw),
		Data:      manifestRaw,
	}
	if err := retrypolicy.Do(ctx, func(rctx context.Context) error {
		return pushIfNotExist(rctx, pb, internalpb.NormalizePrompt("Copying manifest"), src, dst, manifestDesc, repo, tag)
	}, retrypolicy.DoOpts{
		FileSize: manifestDesc.Size,
		FileName: "manifest",
		Config:   &cfg.RetryConfig,
		OnRetry: func(attempt uint, reason string, backoff time.Duration) {
			prompt := fmt.Sprintf("%s (retry %d, %s, waiting %s)",
				internalpb.NormalizePrompt("Copying manifest"), attempt, reason, backoff.Truncate(time.Second))
			pb.Add(prompt, manifestDesc.Digest.String(), manifestDesc.Size, nil)
		},
	}); err != nil {
		return fmt.Errorf("failed to push manifest to remote: %w", err)
	}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OnRetry was added in a later commit; updated to use pb.Placeholder in 973f508.

@aftersnow aftersnow force-pushed the worktree-timeout-retry branch from 9f96098 to 1369e47 Compare March 23, 2026 12:55
@aftersnow
Copy link
Copy Markdown
Contributor Author

/gemini review

@aftersnow
Copy link
Copy Markdown
Contributor Author

@codex-code-assist review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant improvement to the application's resilience by implementing a dynamic, file-size-based retry mechanism for data transfers. The new pkg/retrypolicy is well-designed and replaces the previous fixed retry logic. A key architectural change is the move away from cascading cancellations in errgroup, which will prevent a single failure from halting all concurrent operations. The addition of CLI flags for retry control is also a valuable feature. My review focuses on a critical race condition in the progress bar handling, a minor bug in the retry backoff calculation for logging, and opportunities to reduce code duplication for better maintainability.

Comment thread internal/pb/pb.go
Comment on lines +146 to +158
p.mu.RLock()
existing := p.bars[name]
p.mu.RUnlock()

// If the bar already exists, just reset its message.
if existing != nil {
existing.msg = fmt.Sprintf("%s %s", prompt, name)
existing.Bar.SetCurrent(0)
return
}

// Create a new placeholder bar.
p.Add(prompt, name, size, nil)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a data race condition here. The existing.msg field is being modified without a lock, while it can be concurrently read by the progress bar's rendering goroutine. This can lead to unpredictable behavior or crashes.

To fix this, you should use a write lock to protect both the read from the p.bars map and the subsequent write to the msg field. The lock should be released before calling p.Add to avoid deadlocks, as p.Add acquires its own locks.

Suggested change
p.mu.RLock()
existing := p.bars[name]
p.mu.RUnlock()
// If the bar already exists, just reset its message.
if existing != nil {
existing.msg = fmt.Sprintf("%s %s", prompt, name)
existing.Bar.SetCurrent(0)
return
}
// Create a new placeholder bar.
p.Add(prompt, name, size, nil)
p.mu.Lock()
if existing, ok := p.bars[name]; ok {
// If the bar already exists, just reset its message.
existing.msg = fmt.Sprintf("%s %s", prompt, name)
existing.Bar.SetCurrent(0)
p.mu.Unlock()
return
}
p.mu.Unlock()
// Create a new placeholder bar.
p.Add(prompt, name, size, nil)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be resolved after PR #474 merges — that PR converts bar.msg to atomic.Value, which removes this race. Rebasing here to add a different lock-based fix would conflict with #474's approach, so I'd rather wait and let the rebase pick it up.

Comment thread pkg/backend/fetch.go Outdated
Comment on lines 123 to 130
var annoFilepath string
if layer.Annotations != nil {
if layer.Annotations[modelspec.AnnotationFilepath] != "" {
annoFilepath = layer.Annotations[modelspec.AnnotationFilepath]
} else {
annoFilepath = layer.Annotations[legacymodelspec.AnnotationFilepath]
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic to extract the filepath from annotations is duplicated in pkg/backend/fetch_by_d7y.go and pkg/backend/pull_by_d7y.go. To improve maintainability and reduce code duplication, consider extracting this into a shared helper function within the backend package.

For example, you could define this function:

func getAnnotationFilepath(annotations map[string]string) string {
	if annotations == nil {
		return ""
	}
	if path := annotations[modelspec.AnnotationFilepath]; path != "" {
		return path
	}
	return annotations[legacymodelspec.AnnotationFilepath]
}

And then call it here.

Suggested change
var annoFilepath string
if layer.Annotations != nil {
if layer.Annotations[modelspec.AnnotationFilepath] != "" {
annoFilepath = layer.Annotations[modelspec.AnnotationFilepath]
} else {
annoFilepath = layer.Annotations[legacymodelspec.AnnotationFilepath]
}
}
annoFilepath := getAnnotationFilepath(layer.Annotations)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted to backend.getAnnotationFilepath in 514f604; all six call sites now use the helper.

Comment thread pkg/backend/fetch_by_d7y.go Outdated
Comment on lines +168 to +175
var annoFilepath string
if desc.Annotations != nil {
if desc.Annotations[modelspec.AnnotationFilepath] != "" {
annoFilepath = desc.Annotations[modelspec.AnnotationFilepath]
} else {
annoFilepath = desc.Annotations[legacymodelspec.AnnotationFilepath]
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic to extract the filepath from annotations is duplicated in pkg/backend/fetch.go and pkg/backend/pull_by_d7y.go. To improve maintainability and reduce code duplication, consider extracting this into a shared helper function within the backend package.

For example, you could define this function:

func getAnnotationFilepath(annotations map[string]string) string {
	if annotations == nil {
		return ""
	}
	if path := annotations[modelspec.AnnotationFilepath]; path != "" {
		return path
	}
	return annotations[legacymodelspec.AnnotationFilepath]
}

And then call it here.

Suggested change
var annoFilepath string
if desc.Annotations != nil {
if desc.Annotations[modelspec.AnnotationFilepath] != "" {
annoFilepath = desc.Annotations[modelspec.AnnotationFilepath]
} else {
annoFilepath = desc.Annotations[legacymodelspec.AnnotationFilepath]
}
}
annoFilepath := getAnnotationFilepath(desc.Annotations)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted to backend.getAnnotationFilepath in 514f604.

Comment thread pkg/backend/pull_by_d7y.go Outdated
Comment on lines +192 to +199
var annoFilepath string
if desc.Annotations != nil {
if desc.Annotations[modelspec.AnnotationFilepath] != "" {
annoFilepath = desc.Annotations[modelspec.AnnotationFilepath]
} else {
annoFilepath = desc.Annotations[legacymodelspec.AnnotationFilepath]
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic to extract the filepath from annotations is duplicated in pkg/backend/fetch.go and pkg/backend/fetch_by_d7y.go. To improve maintainability and reduce code duplication, consider extracting this into a shared helper function within the backend package.

For example, you could define this function:

func getAnnotationFilepath(annotations map[string]string) string {
	if annotations == nil {
		return ""
	}
	if path := annotations[modelspec.AnnotationFilepath]; path != "" {
		return path
	}
	return annotations[legacymodelspec.AnnotationFilepath]
}

And then call it here.

Suggested change
var annoFilepath string
if desc.Annotations != nil {
if desc.Annotations[modelspec.AnnotationFilepath] != "" {
annoFilepath = desc.Annotations[modelspec.AnnotationFilepath]
} else {
annoFilepath = desc.Annotations[legacymodelspec.AnnotationFilepath]
}
}
annoFilepath := getAnnotationFilepath(desc.Annotations)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted to backend.getAnnotationFilepath in 514f604.

Comment thread pkg/retrypolicy/retrypolicy.go Outdated
return retryable
}),
retry.OnRetry(func(n uint, err error) {
backoff := computeBackoff(n+1, initialDelay, maxBackoff)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The computeBackoff function is being called with n+1, but it seems to expect the 1-based retry attempt number, which is n. The retry-go library's OnRetry callback provides n as the 1-based retry attempt number. Passing n+1 causes the logged backoff duration to be one step ahead of the actual backoff being used by the retry library.

To ensure the logged backoff duration is accurate, you should call computeBackoff with n.

Suggested change
backoff := computeBackoff(n+1, initialDelay, maxBackoff)
backoff := computeBackoff(n, initialDelay, maxBackoff)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f155ccb — computeBackoff now takes n directly.

Comment on lines +164 to +166
if attempt == 0 {
return initial
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This check for attempt == 0 appears to be dead code. The retry-go library's OnRetry callback provides n as a 1-based retry attempt number, so n will always be 1 or greater. Since computeBackoff is called with n (or n+1), the attempt parameter will never be zero. This block can be safely removed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped in f155ccb; the test was updated to skip attempt=0 for the same reason.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eb1fc1ebc2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread pkg/retrypolicy/retrypolicy.go Outdated
Comment on lines +222 to +223
// Unknown errors default to retryable.
return true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop retrying unclassified errors as transient

This fallback makes every unexpected error retryable, and the new call sites now feed retrypolicy.Do with arbitrary local failures from paths like builder.BuildLayer and extractLayer. A permanent problem such as permission denied or no space left on device will now sit in the retry loop until the size-based deadline expires (10 minutes for small files, 60 minutes for blobs ≥10 GiB), so broken builds/pulls can appear hung for a long time instead of failing promptly.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — the permanent-vs-transient split needs a deliberate allowlist (network timeouts, 5xx, connection resets, rate-limit responses, ...) rather than ad-hoc narrowing. Leaving for follow-up; I'll send a dedicated PR once we agree on the list so this one can land without scope creep.

Comment thread pkg/backend/pull.go
Comment on lines +121 to 124
retryErr := retrypolicy.Do(ctx, func(retryCtx context.Context) error {
logrus.Debugf("pull: processing layer %s", layer.Digest)
// call the before hook.
cfg.Hooks.BeforePullLayer(layer, manifest)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Thread retryCtx through layer pull operations

The per-attempt retryCtx created by retrypolicy.Do is not actually used for the layer transfer here: fn closes over the outer command context, so both pullIfNotExist and pullAndExtractFromRemote still run with ctx. In the normal pull path, any registry read/write that only aborts on context cancellation can therefore ignore --retry-max-time (or the dynamic size-based limit) and block the whole layer instead of timing out and retrying.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this analysis may have been tripped up by variable shadowing: fn declares its own ctx parameter, and at the call site we pass retryCtxpullIfNotExist/pullAndExtractFromRemote run with retryCtx, not the outer command context. Let me know if I'm missing a path where that's not the case.

@aftersnow
Copy link
Copy Markdown
Contributor Author

/gemini review

@aftersnow
Copy link
Copy Markdown
Contributor Author

@codex-code-assist review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust and dynamic retry mechanism based on file size, significantly improving the reliability of data transfer operations. The refactoring to use a dedicated pkg/retrypolicy package centralizes retry logic, and the move to independent per-goroutine retries effectively addresses the cascading cancellation issue. The addition of CLI flags for retry configuration provides users with greater control. The comprehensive unit tests for the new retry policy are commendable, ensuring the correctness of the dynamic parameter calculations and error handling.

Overall, this is a well-designed and implemented feature that directly tackles a critical pain point for large file transfers.

Comment thread pkg/backend/push.go
Comment on lines +108 to +112
OnRetry: func(attempt uint, reason string, backoff time.Duration) {
prompt := fmt.Sprintf("%s (retry %d, %s, waiting %s)",
internalpb.NormalizePrompt("Copying blob"), attempt, reason, backoff.Truncate(time.Second))
pb.Add(prompt, layer.Digest.String(), layer.Size, nil)
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For updating the progress bar during a retry, pb.Placeholder is more semantically appropriate than pb.Add. The Placeholder method is designed to update an existing bar's message and reset its current progress to zero, which is ideal for indicating a backoff period before the next attempt. This approach is already used consistently in pull.go and fetch.go.

Please apply this change to the OnRetry callbacks for layers, config, and manifest pushes for consistency and clearer UX.

For example, the OnRetry for layers could be updated as follows:

Suggested change
OnRetry: func(attempt uint, reason string, backoff time.Duration) {
prompt := fmt.Sprintf("%s (retry %d, %s, waiting %s)",
internalpb.NormalizePrompt("Copying blob"), attempt, reason, backoff.Truncate(time.Second))
pb.Add(prompt, layer.Digest.String(), layer.Size, nil)
},
OnRetry: func(attempt uint, reason string, backoff time.Duration) {
prompt := fmt.Sprintf("%s (retry %d, %s, waiting %s)",
internalpb.NormalizePrompt("Copying blob"), attempt, reason, backoff.Truncate(time.Second))
pb.Placeholder(layer.Digest.String(), prompt, layer.Size)
},

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied to layer, config, and manifest retries in 973f508.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f3fbdfe46a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread pkg/backend/push.go Outdated
Comment on lines +122 to +124
g.Wait()
if len(errs) > 0 {
return fmt.Errorf("failed to push blob to remote: %w", errors.Join(errs...))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate layer cancellation before pushing config

When ctx is cancelled after some layer uploads finish but before the remaining goroutines start (common when cfg.Concurrency < len(manifest.Layers)), those queued workers return ctx.Err() from the early select, but g.Wait() is ignored here. If the already-running uploads succeeded, errs stays empty and Push continues on to the config/manifest uploads, so the command can report success after skipping layers and may publish an incomplete artifact.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

g.Wait() now captures its return into errs (973f508), so a cancelled batch can no longer slip through to the config/manifest push with an incomplete layer set.

Comment thread pkg/backend/pull.go Outdated
Comment on lines +152 to +154
_ = g.Wait()
if len(errs) > 0 {
return fmt.Errorf("failed to pull blob to local: %w", errors.Join(errs...))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate layer cancellation before finalizing pull

If the command context is cancelled between layer batches, the queued workers exit via the top-of-goroutine ctx.Done() check, but this _ = g.Wait() drops that cancellation and errs remains empty. In extract-from-remote mode that means pull can return success after extracting only a subset of files; in normal mode it can also proceed to fetch config/manifest after skipping some layers, leaving an incomplete local artifact.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same defensive capture applied to pull in 973f508.

Comment thread pkg/backend/processor/base.go Outdated
Comment on lines +136 to +138
select {
case <-ctx.Done():
return ctx.Err()
return nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Return context cancellation from skipped build workers

Returning nil here means any worker that never starts because the build context was already cancelled is treated as a success. Combined with the ignored eg.Wait() result below, cancelling a build after some files finish but before the rest start lets Process return no error with only a partial descriptors set, so Build can emit an apparently successful artifact that is missing files from the worktree.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 973f508: the select now returns ctx.Err() and eg.Wait()'s return is captured into errs, so a cancelled build can no longer return a partial descriptor set as a success.

@aftersnow
Copy link
Copy Markdown
Contributor Author

@codex-code-assist review

@aftersnow aftersnow force-pushed the worktree-timeout-retry branch from 6ec310c to 1f6c0f5 Compare March 24, 2026 06:28
…ameters

Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
…ld/fetch

Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
…ading cancellation

Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
…artifacts

Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
… tests

Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>
@aftersnow aftersnow force-pushed the worktree-timeout-retry branch from 1f6c0f5 to fcdbcb0 Compare March 24, 2026 06:34
retry-go's OnRetry callback supplies a 1-based retry attempt number,
so computeBackoff(n+1, ...) logged a delay that was one doubling
ahead of the actual backoff used by the retry loop. Pass n directly.

Also drop the unreachable attempt == 0 branch in computeBackoff
(retry-go never supplies 0) and skip that case in the test.

Addresses review feedback on PR #468 (gemini).

Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
The 'prefer modelspec.AnnotationFilepath, fall back to the legacy
dragonflyoss key' pattern was duplicated across fetch.go,
fetch_by_d7y.go, and pull_by_d7y.go (six call sites total). Centralize
it in backend.getAnnotationFilepath so each caller is a one-liner and
future changes to the annotation resolution live in one place.

Addresses review feedback on PR #468 (gemini).

Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
Three call sites (push, pull, processor) used 'g.Wait()' without
capturing its return, so a cancelled worker's ctx.Err() could be
discarded. In the edge case where all running uploads/layers succeed
but queued-but-unstarted workers exit via the ctx.Done() select,
errs stayed empty and the operation continued to config/manifest push
(or finalized a partial descriptor set), potentially publishing an
incomplete artifact. Capture the Wait() result into errs and rely on
the existing ctx.Err()/len(errs) checks to return cleanly. Also change
processor/base.go's select from 'return nil' to 'return ctx.Err()' so
the same propagation path applies there.

Additionally, align push.go's OnRetry with pull.go/fetch.go by using
pb.Placeholder (designed to reset an existing bar's message and
progress) instead of pb.Add for layer, config, and manifest retries.

Addresses review feedback on PR #468 (codex P1, gemini).

Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
@aftersnow aftersnow marked this pull request as draft April 23, 2026 05:58
The previous design used a single MaxRetryTime budget to bound both
the in-flight transfer and the inter-attempt sleeps. With a wall-clock
that scales with file size, a slow first attempt can consume the whole
budget, leaving no room for retries — exactly when retries matter most.

Split into two independent knobs:

  * PerAttemptTimeout: derived from file size assuming a 10 MiB/s
    minimum throughput with a 2x safety factor, clamped to [5min, 8h].
    Each attempt gets its own context.WithTimeout, cancelled per
    attempt.

  * MaxAttempts + MaxBackoff: bound retry-only behavior. Defaults are
    constants (6 attempts, 2min backoff cap) and do not scale with
    file size — transient-failure recovery time is payload-independent.

The retry loop classifies a per-attempt DeadlineExceeded under a live
parent context as retryable, so a single transfer timeout no longer
short-circuits the loop. User cancellation (parent ctx) still aborts
immediately.

CLI:

  --retry-max-time       removed (semantics conflated transfer and retry)
  --retry-attempts       new (int, total attempts including initial)
  --per-attempt-timeout  new (Duration; 0 = derive from size, <0 = off)
  --no-retry             unchanged

Tests cover the size→timeout mapping with clamps, the per-attempt
deadline retry loop, parent-cancel propagation, and a size-invariance
test that pins the design's core property: total retry wall-clock does
not depend on file size.

Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
@aftersnow
Copy link
Copy Markdown
Contributor Author

@codex-code-assist review

No behavioral change. golangci-lint v2.5 (gci, golines with line
length 122) flagged whitespace alignment in struct literals and
long single-line table-driven test entries; auto-fixed via
make lint-fix.

Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fc3ac13156

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread pkg/retrypolicy/retrypolicy.go Outdated
Comment on lines +148 to +150
if cfg.NoRetry {
return fn(ctx)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Honor per-attempt timeout when retries are disabled

Do returns early when NoRetry is true, so it calls fn(ctx) without applying PerAttemptTimeout. This means --no-retry silently disables the per-attempt deadline instead of only disabling extra attempts, and a single transfer can hang indefinitely unless the caller set its own context deadline. Users who combine --no-retry with --per-attempt-timeout will not get the timeout behavior they requested.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed. The per-attempt deadline now applies on the NoRetry path too (single-attempt) since it's a transfer-bound concern, not a retry concern. Added TestDo_NoRetryHonorsPerAttemptTimeout to lock the invariant.

The previous Do() short-circuited NoRetry by calling fn(ctx) directly,
bypassing the per-attempt deadline derivation. As Codex flagged, this
made --no-retry silently disable the transfer timeout: a hung
connection would never terminate, leaving users with a stalled CLI
and no failure signal.

Per-attempt timeout is a transfer-bound concern, not a retry concern,
so it must apply on the single-attempt path too. Refactor the deadline
setup into a small runAttempt closure used by both the NoRetry branch
and retry.Do.

Add TestDo_NoRetryHonorsPerAttemptTimeout to pin this invariant.

Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
@aftersnow
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a specialized retrypolicy package to manage blob transfer retries, decoupling per-attempt timeouts from the retry backoff strategy. It adds CLI flags for retry configuration across the build, fetch, pull, and push commands and refactors the backend to use this new policy, improving error handling and progress tracking. A Placeholder method was added to the progress bar to handle resets during retries. Feedback suggests centralizing the progress bar reset logic within the Placeholder method and adopting the go-humanize library for byte formatting to eliminate redundant code.

Comment thread internal/pb/pb.go
Comment on lines +151 to +154
if existing != nil {
existing.msg = fmt.Sprintf("%s %s", prompt, name)
existing.Bar.SetCurrent(0)
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Placeholder method should encapsulate the full reset logic used during retries, including setting the refill value and resetting the EWMA speed calculation. This ensures consistency across different transfer paths and avoids manual bar manipulation in the backend packages.

Suggested change
if existing != nil {
existing.msg = fmt.Sprintf("%s %s", prompt, name)
existing.Bar.SetCurrent(0)
return
if existing != nil {
existing.msg = fmt.Sprintf("%s %s", prompt, name)
existing.SetRefill(existing.Current())
existing.SetCurrent(0)
existing.EwmaSetCurrent(0, time.Second)
return
}

Comment thread pkg/backend/fetch.go
Comment on lines +127 to +133
OnRetry: func(attempt uint, reason string, backoff time.Duration) {
if bar := pb.Get(layer.Digest.String()); bar != nil {
bar.SetRefill(bar.Current())
bar.SetCurrent(0)
bar.EwmaSetCurrent(0, time.Second)
}
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use the pb.Placeholder method to reset the progress bar during retries. This avoids manual manipulation of the progress bar and ensures that the reset logic (including refill and EWMA reset) is applied consistently.

				OnRetry: func(attempt uint, reason string, backoff time.Duration) {
					pb.Placeholder(layer.Digest.String(), internalpb.NormalizePrompt("Fetching blob"), layer.Size)
				},

Comment on lines +183 to +189
OnRetry: func(attempt uint, reason string, backoff time.Duration) {
if bar := pb.Get(desc.Digest.String()); bar != nil {
bar.SetRefill(bar.Current())
bar.SetCurrent(0)
bar.EwmaSetCurrent(0, time.Second)
}
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use the pb.Placeholder method to reset the progress bar during retries for consistency with other transfer paths.

		OnRetry: func(attempt uint, reason string, backoff time.Duration) {
			pb.Placeholder(desc.Digest.String(), internalpb.NormalizePrompt("Fetching blob"), desc.Size)
		},

Comment on lines +210 to +216
OnRetry: func(attempt uint, reason string, backoff time.Duration) {
if bar := pb.Get(desc.Digest.String()); bar != nil {
bar.SetRefill(bar.Current())
bar.SetCurrent(0)
bar.EwmaSetCurrent(0, time.Second)
}
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use the pb.Placeholder method to reset the progress bar during retries for consistency with other transfer paths.

		OnRetry: func(attempt uint, reason string, backoff time.Duration) {
			pb.Placeholder(desc.Digest.String(), internalpb.NormalizePrompt("Pulling blob"), desc.Size)
		},

Comment on lines +35 to +46
import (
"context"
"errors"
"fmt"
"math"
"regexp"
"strings"
"time"

retry "github.com/avast/retry-go/v4"
log "github.com/sirupsen/logrus"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add github.com/dustin/go-humanize to the imports to use it for byte formatting, avoiding duplication of the humanizeBytes function.

Suggested change
import (
"context"
"errors"
"fmt"
"math"
"regexp"
"strings"
"time"
retry "github.com/avast/retry-go/v4"
log "github.com/sirupsen/logrus"
)
import (
"context"
"errors"
"fmt"
"math"
"regexp"
"strings"
"time"
retry "github.com/avast/retry-go/v4"
humanize "github.com/dustin/go-humanize"
log "github.com/sirupsen/logrus"
)
References
  1. To avoid code duplication, use functions from existing dependencies for common tasks (e.g., byte formatting) instead of reimplementing them across multiple packages.

Comment on lines +378 to +395
func humanizeBytes(b int64) string {
const (
kb = 1024
mb = 1024 * kb
gb = 1024 * mb
)

switch {
case b >= gb:
return fmt.Sprintf("%.1f GB", float64(b)/float64(gb))
case b >= mb:
return fmt.Sprintf("%.1f MB", float64(b)/float64(mb))
case b >= kb:
return fmt.Sprintf("%.1f KB", float64(b)/float64(kb))
default:
return fmt.Sprintf("%d B", b)
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To avoid code duplication and maintain consistency, use the existing github.com/dustin/go-humanize dependency for byte formatting instead of reimplementing humanizeBytes. This follows the general rule for common tasks.

func humanizeBytes(b int64) string {
	return humanize.Bytes(uint64(b))
}
References
  1. To avoid code duplication, use functions from existing dependencies for common tasks (e.g., byte formatting) instead of reimplementing them across multiple packages.

aftersnow added 2 commits May 6, 2026 20:58
The --no-retry flag was redundant once the retry budget was decoupled
from the per-attempt timeout: it expressed nothing that
--retry-attempts=1 cannot. Keeping both invited the class of bug Codex
found earlier (NoRetry short-circuited Do() and bypassed the
per-attempt deadline) — having two ways to say "don't retry" makes it
easy to wire only one of them through correctly.

After this commit there are exactly two retry-related CLI flags:

  --retry-attempts        int (default 0 → 6; set to 1 for fail-fast)
  --per-attempt-timeout   duration (0 → derive from size; <0 → disabled)

Each controls one orthogonal concern. "No retry" is a value of one of
them, not a separate flag.

BREAKING CHANGE: --no-retry is removed. Replace with --retry-attempts=1.
The flag was introduced in this same PR and never shipped, so external
breakage scope is just the PR's own iteration history.

Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
Remove --retry-attempts and --per-attempt-timeout. Both controlled
operational behavior (retry tolerance, network speed assumption) that
practically never varies per invocation, so they didn't earn a slot
on the CLI. Keeping them invited misuse and grew help-text noise for
no measurable benefit.

After this commit, retry behavior is fully determined by package
defaults in pkg/retrypolicy:

  - 6 attempts (DefaultMaxAttempts)
  - 5s initial delay, exponential backoff capped at 2 min
  - Per-attempt timeout derived from file size (10 MiB/s minimum
    throughput, 2x safety, clamped to [5min, 8h])

The retrypolicy.Config struct stays public for programmatic users
embedding modctl in another binary; it just isn't wired to any CLI
flag. If a real user shows up with a network where the defaults break
down, we can add a knob then with their case as the documented
example — until then, YAGNI wins.

Cleanup:
  - cmd/{push,pull,build,fetch}.go: drop flag registrations.
  - pkg/config/{push,pull,build,fetch}.go: drop RetryConfig field
    and the now-unused retrypolicy import.
  - pkg/backend/{push,pull,fetch,build,*_by_d7y}.go: drop
    Config: &cfg.RetryConfig from each retrypolicy.Do call.
  - pkg/backend/processor/options.go: drop WithRetryConfig and the
    retryConfig field; processor.Process no longer takes a retry
    option.
  - pkg/backend/processor/base.go: stop threading processOpts.retryConfig
    into retrypolicy.DoOpts.

BREAKING CHANGE: --retry-attempts and --per-attempt-timeout are
removed. Both were introduced earlier in this same PR and never
shipped.

Signed-off-by: Zhao Chen <winters.zc@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant