feat: add dynamic timeout retry with file-size-based parameters by aftersnow · Pull Request #468 · modelpack/modctl

aftersnow · 2026-03-23T12:29:10Z

Summary

Add pkg/retrypolicy package with decoupled per-attempt timeout and retry budget:
- Per-attempt timeout derives from file size (assumed 10 MiB/s minimum throughput, 2x safety factor, clamped to [5 min, 8 h]). Each attempt gets its own deadline.
- Total attempt count and per-sleep cap are constants and do not scale with file size.
Replace cascading errgroup.WithContext cancellation with independent per-goroutine retry — one layer failure no longer cancels siblings.
Apply retry consistently across all data transfer paths: push, pull, fetch, build, and Dragonfly variants.
No new CLI flags. Retry behavior is determined entirely by package defaults; the public retrypolicy.Config struct stays available for programmatic embedders.

Motivation

When pushing large model files (multi-GB to multi-TB) to OCI registries backed by rate-limited storage (e.g., Harbor + OSS), i/o timeout errors frequently occur. The previous retry mechanism had three problems:

Fixed retry parameters didn't account for file size — large files need longer per-attempt timeouts.
Cascading cancellation via errgroup.WithContext meant one timeout killed all in-flight transfers.
Coupled timing semantics — an earlier iteration introduced a single MaxRetryTime budget, but it covered both in-flight transfer time and inter-attempt sleeps. With a wall-clock that scales with file size, a slow first attempt could consume the whole budget, leaving no room for retries — exactly when retries matter most.

The current design splits the timing concerns into two independent constants, with no per-invocation overrides. We considered exposing them as CLI flags (--retry-attempts, --per-attempt-timeout) but found they were operational settings that rarely vary per invocation; YAGNI says skip until a real user case shows up.

Design

Concern	Where it lives	Scales with file size?
Per-attempt deadline	`ComputePerAttemptTimeout(fileSize)` — `file_size / 10 MiB/s × 2`, clamped to `[5 min, 8 h]`	Yes
Total attempts (initial + retries)	`DefaultMaxAttempts = 6`	No
Per-sleep cap	`DefaultMaxBackoff = 2 min`	No
Initial backoff	`DefaultInitialDelay = 5 s`	No

Per-attempt deadlines are derived inside Do() via context.WithTimeout(ctx, perAttemptTimeout); the parent ctx is reserved for user cancellation. A DeadlineExceeded error under a live parent context is reclassified as retryable, so a single transfer timeout no longer short-circuits the retry loop.

Examples (ComputePerAttemptTimeout):

File size	Per-attempt timeout
≤ 1 GB	5 min (floor)
10 GB	~34 min
70 GB	~4 h
≥ 140 GB	8 h (ceiling)

Changes

Area	What changed
`pkg/retrypolicy/`	New package: `Do()`, `IsRetryable()`, `ShortReason()`, `ComputePerAttemptTimeout()`
`pkg/backend/push.go`	Independent retry per layer, `retrypolicy.Do` for config/manifest
`pkg/backend/pull.go`	Same pattern + `pb.Placeholder()` for retry progress display
`pkg/backend/fetch.go`	Added retry (previously had none)
`pkg/backend/build.go`	Config/manifest build wrapped in retry
`pkg/backend/processor/base.go`	Removed cascading cancellation, added per-file retry
`pkg/backend/pull_by_d7y.go`, `fetch_by_d7y.go`	Dragonfly paths use retry
`internal/pb/pb.go`	`Placeholder()` method for retry backoff display
`pkg/backend/retry.go`	Deleted (legacy `defaultRetryOpts`)

No new CLI flags or config fields.

Test plan

Unit tests in pkg/retrypolicy/retrypolicy_test.go, including a size-invariance test that pins the design's core property: total retry wall-clock does not depend on file size.
go vet ./... clean.
go test -race ./pkg/retrypolicy/... clean.
make lint clean (golangci-lint v2.5.0).
E2E workflow passed on GitHub Actions.
Manual large-file push test against rate-limited registry.

gemini-code-assist · 2026-03-23T12:29:40Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of file transfer operations by introducing a dynamic and intelligent retry mechanism. It addresses previous limitations where fixed retry parameters and cascading cancellations led to frequent failures, especially with large files over unreliable networks. The new system adapts retry behavior based on file size and isolates failures, ensuring that individual transient errors do not disrupt the entire transfer process, thereby improving overall reliability and user experience.

Highlights

Dynamic Retry Policy: Implemented a new pkg/retrypolicy package that provides dynamic retry parameters (max retry time and max backoff) based on file size. Small files (<=1GB) use 10min/1min backoff, large files (>=10GB) use 60min/10min backoff, with linear interpolation for sizes in between.
Independent Per-Goroutine Retry: Replaced the cascading cancellation mechanism of errgroup.WithContext with independent retry logic for each concurrent operation (e.g., layer transfer). This ensures that a transient failure in one operation does not cancel all other in-flight transfers.
New CLI Flags: Added --no-retry to disable retries and --retry-max-time to set a custom maximum total retry duration per file (0 for dynamic) to the push, pull, build, and fetch commands.
Consistent Retry Application: Applied the new dynamic retry logic consistently across all data transfer paths, including push, pull, fetch, build, and their respective Dragonfly variants.
Progress Bar Integration: Introduced a Placeholder() method in the progress bar utility to display retry backoff messages, providing better user feedback during transient network issues.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a robust dynamic retry mechanism with file-size-based parameters, which is a significant improvement for handling transient network errors during large file transfers. The new pkg/retrypolicy is well-designed and thoroughly tested. A key architectural improvement is replacing cascading errgroup.WithContext with independent per-goroutine retries, preventing a single failure from terminating all concurrent operations. The addition of CLI flags for retry control and progress bar updates on retry significantly enhance usability. The changes are consistently applied across all data transfer paths. I have a couple of suggestions for improvement in pkg/backend/push.go to ensure a consistent user experience during retries.

gemini-code-assist · 2026-03-23T12:32:03Z

+	if err := retrypolicy.Do(ctx, func(rctx context.Context) error {
+		return pushIfNotExist(rctx, pb, internalpb.NormalizePrompt("Copying config"), src, dst, manifest.Config, repo, tag)
+	}, retrypolicy.DoOpts{
+		FileSize: manifest.Config.Size,
+		FileName: "config",
+		Config:   &cfg.RetryConfig,
+	}); err != nil {
 		return fmt.Errorf("failed to push config to remote: %w", err)
 	}


The retry logic for pushing the config object is missing an OnRetry handler. This is inconsistent with the retry logic for layers in this file and for all components in pkg/backend/pull.go. Adding an OnRetry handler will provide better visual feedback to the user during backoff periods, improving the user experience.

if err := retrypolicy.Do(ctx, func(rctx context.Context) error { return pushIfNotExist(rctx, pb, internalpb.NormalizePrompt("Copying config"), src, dst, manifest.Config, repo, tag) }, retrypolicy.DoOpts{ FileSize: manifest.Config.Size, FileName: "config", Config: &cfg.RetryConfig, OnRetry: func(attempt uint, reason string, backoff time.Duration) { prompt := fmt.Sprintf("%s (retry %d, %s, waiting %s)", internalpb.NormalizePrompt("Copying config"), attempt, reason, backoff.Truncate(time.Second)) pb.Add(prompt, manifest.Config.Digest.String(), manifest.Config.Size, nil) }, }); err != nil { return fmt.Errorf("failed to push config to remote: %w", err) }

OnRetry was added in a later commit on this branch; updated to use pb.Placeholder in 973f508 for consistency with pull/fetch.

gemini-code-assist · 2026-03-23T12:32:03Z

 	// copy the manifest.
-	if err := retry.Do(func() error {
-		return pushIfNotExist(ctx, pb, internalpb.NormalizePrompt("Copying manifest"), src, dst, ocispec.Descriptor{
+	if err := retrypolicy.Do(ctx, func(rctx context.Context) error {
+		return pushIfNotExist(rctx, pb, internalpb.NormalizePrompt("Copying manifest"), src, dst, ocispec.Descriptor{
 			MediaType: manifest.MediaType,
 			Size:      int64(len(manifestRaw)),
 			Digest:    godigest.FromBytes(manifestRaw),
 			Data:      manifestRaw,
 		}, repo, tag)
-	}, append(defaultRetryOpts, retry.Context(ctx))...); err != nil {
+	}, retrypolicy.DoOpts{
+		FileSize: int64(len(manifestRaw)),
+		FileName: "manifest",
+		Config:   &cfg.RetryConfig,
+	}); err != nil {
 		return fmt.Errorf("failed to push manifest to remote: %w", err)
 	}


Similar to the config push, the retry logic for pushing the manifest is missing an OnRetry handler. For consistency and better user experience, a handler should be added to update the progress bar during retries. I've also refactored the descriptor creation to make it accessible in the OnRetry callback.

// copy the manifest. manifestDesc := ocispec.Descriptor{ MediaType: manifest.MediaType, Size: int64(len(manifestRaw)), Digest: godigest.FromBytes(manifestRaw), Data: manifestRaw, } if err := retrypolicy.Do(ctx, func(rctx context.Context) error { return pushIfNotExist(rctx, pb, internalpb.NormalizePrompt("Copying manifest"), src, dst, manifestDesc, repo, tag) }, retrypolicy.DoOpts{ FileSize: manifestDesc.Size, FileName: "manifest", Config: &cfg.RetryConfig, OnRetry: func(attempt uint, reason string, backoff time.Duration) { prompt := fmt.Sprintf("%s (retry %d, %s, waiting %s)", internalpb.NormalizePrompt("Copying manifest"), attempt, reason, backoff.Truncate(time.Second)) pb.Add(prompt, manifestDesc.Digest.String(), manifestDesc.Size, nil) }, }); err != nil { return fmt.Errorf("failed to push manifest to remote: %w", err) }

OnRetry was added in a later commit; updated to use pb.Placeholder in 973f508.

aftersnow · 2026-03-23T13:35:06Z

/gemini review

aftersnow · 2026-03-23T13:35:09Z

@codex-code-assist review

gemini-code-assist

Code Review

This pull request introduces a significant improvement to the application's resilience by implementing a dynamic, file-size-based retry mechanism for data transfers. The new pkg/retrypolicy is well-designed and replaces the previous fixed retry logic. A key architectural change is the move away from cascading cancellations in errgroup, which will prevent a single failure from halting all concurrent operations. The addition of CLI flags for retry control is also a valuable feature. My review focuses on a critical race condition in the progress bar handling, a minor bug in the retry backoff calculation for logging, and opportunities to reduce code duplication for better maintainability.

gemini-code-assist · 2026-03-23T13:38:18Z

+	p.mu.RLock()
+	existing := p.bars[name]
+	p.mu.RUnlock()
+
+	// If the bar already exists, just reset its message.
+	if existing != nil {
+		existing.msg = fmt.Sprintf("%s %s", prompt, name)
+		existing.Bar.SetCurrent(0)
+		return
+	}
+
+	// Create a new placeholder bar.
+	p.Add(prompt, name, size, nil)


There is a data race condition here. The existing.msg field is being modified without a lock, while it can be concurrently read by the progress bar's rendering goroutine. This can lead to unpredictable behavior or crashes.

To fix this, you should use a write lock to protect both the read from the p.bars map and the subsequent write to the msg field. The lock should be released before calling p.Add to avoid deadlocks, as p.Add acquires its own locks.

Suggested change

p.mu.RLock()

existing := p.bars[name]

p.mu.RUnlock()

// If the bar already exists, just reset its message.

if existing != nil {

existing.msg = fmt.Sprintf("%s %s", prompt, name)

existing.Bar.SetCurrent(0)

return

}

// Create a new placeholder bar.

p.Add(prompt, name, size, nil)

p.mu.Lock()

if existing, ok := p.bars[name]; ok {

// If the bar already exists, just reset its message.

existing.msg = fmt.Sprintf("%s %s", prompt, name)

existing.Bar.SetCurrent(0)

p.mu.Unlock()

return

}

p.mu.Unlock()

// Create a new placeholder bar.

p.Add(prompt, name, size, nil)

Will be resolved after PR #474 merges — that PR converts bar.msg to atomic.Value, which removes this race. Rebasing here to add a different lock-based fix would conflict with #474's approach, so I'd rather wait and let the rebase pick it up.

gemini-code-assist · 2026-03-23T13:38:18Z

+			var annoFilepath string
+			if layer.Annotations != nil {
+				if layer.Annotations[modelspec.AnnotationFilepath] != "" {
+					annoFilepath = layer.Annotations[modelspec.AnnotationFilepath]
+				} else {
+					annoFilepath = layer.Annotations[legacymodelspec.AnnotationFilepath]
+				}
 			}


This logic to extract the filepath from annotations is duplicated in pkg/backend/fetch_by_d7y.go and pkg/backend/pull_by_d7y.go. To improve maintainability and reduce code duplication, consider extracting this into a shared helper function within the backend package.

For example, you could define this function:

func getAnnotationFilepath(annotations map[string]string) string { if annotations == nil { return "" } if path := annotations[modelspec.AnnotationFilepath]; path != "" { return path } return annotations[legacymodelspec.AnnotationFilepath] }

And then call it here.

Suggested change

var annoFilepath string

if layer.Annotations != nil {

if layer.Annotations[modelspec.AnnotationFilepath] != "" {

annoFilepath = layer.Annotations[modelspec.AnnotationFilepath]

} else {

annoFilepath = layer.Annotations[legacymodelspec.AnnotationFilepath]

}

}

annoFilepath := getAnnotationFilepath(layer.Annotations)

Extracted to backend.getAnnotationFilepath in 514f604; all six call sites now use the helper.

gemini-code-assist · 2026-03-23T13:38:18Z

+	var annoFilepath string
+	if desc.Annotations != nil {
+		if desc.Annotations[modelspec.AnnotationFilepath] != "" {
+			annoFilepath = desc.Annotations[modelspec.AnnotationFilepath]
+		} else {
+			annoFilepath = desc.Annotations[legacymodelspec.AnnotationFilepath]
+		}
+	}


This logic to extract the filepath from annotations is duplicated in pkg/backend/fetch.go and pkg/backend/pull_by_d7y.go. To improve maintainability and reduce code duplication, consider extracting this into a shared helper function within the backend package.

For example, you could define this function:

func getAnnotationFilepath(annotations map[string]string) string { if annotations == nil { return "" } if path := annotations[modelspec.AnnotationFilepath]; path != "" { return path } return annotations[legacymodelspec.AnnotationFilepath] }

And then call it here.

Suggested change

var annoFilepath string

if desc.Annotations != nil {

if desc.Annotations[modelspec.AnnotationFilepath] != "" {

annoFilepath = desc.Annotations[modelspec.AnnotationFilepath]

} else {

annoFilepath = desc.Annotations[legacymodelspec.AnnotationFilepath]

}

}

annoFilepath := getAnnotationFilepath(desc.Annotations)

Extracted to backend.getAnnotationFilepath in 514f604.

gemini-code-assist · 2026-03-23T13:38:18Z

+	var annoFilepath string
+	if desc.Annotations != nil {
+		if desc.Annotations[modelspec.AnnotationFilepath] != "" {
+			annoFilepath = desc.Annotations[modelspec.AnnotationFilepath]
+		} else {
+			annoFilepath = desc.Annotations[legacymodelspec.AnnotationFilepath]
+		}
+	}


This logic to extract the filepath from annotations is duplicated in pkg/backend/fetch.go and pkg/backend/fetch_by_d7y.go. To improve maintainability and reduce code duplication, consider extracting this into a shared helper function within the backend package.

For example, you could define this function:

func getAnnotationFilepath(annotations map[string]string) string { if annotations == nil { return "" } if path := annotations[modelspec.AnnotationFilepath]; path != "" { return path } return annotations[legacymodelspec.AnnotationFilepath] }

And then call it here.

Suggested change

var annoFilepath string

if desc.Annotations != nil {

if desc.Annotations[modelspec.AnnotationFilepath] != "" {

annoFilepath = desc.Annotations[modelspec.AnnotationFilepath]

} else {

annoFilepath = desc.Annotations[legacymodelspec.AnnotationFilepath]

}

}

annoFilepath := getAnnotationFilepath(desc.Annotations)

Extracted to backend.getAnnotationFilepath in 514f604.

gemini-code-assist · 2026-03-23T13:38:18Z

+			return retryable
+		}),
+		retry.OnRetry(func(n uint, err error) {
+			backoff := computeBackoff(n+1, initialDelay, maxBackoff)


The computeBackoff function is being called with n+1, but it seems to expect the 1-based retry attempt number, which is n. The retry-go library's OnRetry callback provides n as the 1-based retry attempt number. Passing n+1 causes the logged backoff duration to be one step ahead of the actual backoff being used by the retry library.

To ensure the logged backoff duration is accurate, you should call computeBackoff with n.

Suggested change

backoff := computeBackoff(n+1, initialDelay, maxBackoff)

backoff := computeBackoff(n, initialDelay, maxBackoff)

Fixed in f155ccb — computeBackoff now takes n directly.

gemini-code-assist · 2026-03-23T13:38:19Z

+	if attempt == 0 {
+		return initial
+	}


This check for attempt == 0 appears to be dead code. The retry-go library's OnRetry callback provides n as a 1-based retry attempt number, so n will always be 1 or greater. Since computeBackoff is called with n (or n+1), the attempt parameter will never be zero. This block can be safely removed.

Dropped in f155ccb; the test was updated to skip attempt=0 for the same reason.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eb1fc1ebc2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-23T13:42:55Z

+	// Unknown errors default to retryable.
+	return true


Stop retrying unclassified errors as transient

This fallback makes every unexpected error retryable, and the new call sites now feed retrypolicy.Do with arbitrary local failures from paths like builder.BuildLayer and extractLayer. A permanent problem such as permission denied or no space left on device will now sit in the retry loop until the size-based deadline expires (10 minutes for small files, 60 minutes for blobs ≥10 GiB), so broken builds/pulls can appear hung for a long time instead of failing promptly.

Useful? React with 👍 / 👎.

Good point — the permanent-vs-transient split needs a deliberate allowlist (network timeouts, 5xx, connection resets, rate-limit responses, ...) rather than ad-hoc narrowing. Leaving for follow-up; I'll send a dedicated PR once we agree on the list so this one can land without scope creep.

chatgpt-codex-connector · 2026-03-23T13:42:55Z

+			retryErr := retrypolicy.Do(ctx, func(retryCtx context.Context) error {
 				logrus.Debugf("pull: processing layer %s", layer.Digest)
 				// call the before hook.
 				cfg.Hooks.BeforePullLayer(layer, manifest)


Thread retryCtx through layer pull operations

The per-attempt retryCtx created by retrypolicy.Do is not actually used for the layer transfer here: fn closes over the outer command context, so both pullIfNotExist and pullAndExtractFromRemote still run with ctx. In the normal pull path, any registry read/write that only aborts on context cancellation can therefore ignore --retry-max-time (or the dynamic size-based limit) and block the whole layer instead of timing out and retrying.

Useful? React with 👍 / 👎.

I think this analysis may have been tripped up by variable shadowing: fn declares its own ctx parameter, and at the call site we pass retryCtx — pullIfNotExist/pullAndExtractFromRemote run with retryCtx, not the outer command context. Let me know if I'm missing a path where that's not the case.

aftersnow · 2026-03-23T14:01:40Z

/gemini review

aftersnow · 2026-03-23T14:01:42Z

@codex-code-assist review

gemini-code-assist

Code Review

This pull request introduces a robust and dynamic retry mechanism based on file size, significantly improving the reliability of data transfer operations. The refactoring to use a dedicated pkg/retrypolicy package centralizes retry logic, and the move to independent per-goroutine retries effectively addresses the cascading cancellation issue. The addition of CLI flags for retry configuration provides users with greater control. The comprehensive unit tests for the new retry policy are commendable, ensuring the correctness of the dynamic parameter calculations and error handling.

Overall, this is a well-designed and implemented feature that directly tackles a critical pain point for large file transfers.

gemini-code-assist · 2026-03-23T14:07:26Z

+				OnRetry: func(attempt uint, reason string, backoff time.Duration) {
+					prompt := fmt.Sprintf("%s (retry %d, %s, waiting %s)",
+						internalpb.NormalizePrompt("Copying blob"), attempt, reason, backoff.Truncate(time.Second))
+					pb.Add(prompt, layer.Digest.String(), layer.Size, nil)
+				},


For updating the progress bar during a retry, pb.Placeholder is more semantically appropriate than pb.Add. The Placeholder method is designed to update an existing bar's message and reset its current progress to zero, which is ideal for indicating a backoff period before the next attempt. This approach is already used consistently in pull.go and fetch.go.

Please apply this change to the OnRetry callbacks for layers, config, and manifest pushes for consistency and clearer UX.

For example, the OnRetry for layers could be updated as follows:

Suggested change

OnRetry: func(attempt uint, reason string, backoff time.Duration) {

prompt := fmt.Sprintf("%s (retry %d, %s, waiting %s)",

internalpb.NormalizePrompt("Copying blob"), attempt, reason, backoff.Truncate(time.Second))

pb.Add(prompt, layer.Digest.String(), layer.Size, nil)

},

OnRetry: func(attempt uint, reason string, backoff time.Duration) {

prompt := fmt.Sprintf("%s (retry %d, %s, waiting %s)",

internalpb.NormalizePrompt("Copying blob"), attempt, reason, backoff.Truncate(time.Second))

pb.Placeholder(layer.Digest.String(), prompt, layer.Size)

},

Applied to layer, config, and manifest retries in 973f508.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f3fbdfe46a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-23T14:08:28Z

+	g.Wait()
+	if len(errs) > 0 {
+		return fmt.Errorf("failed to push blob to remote: %w", errors.Join(errs...))


Propagate layer cancellation before pushing config

When ctx is cancelled after some layer uploads finish but before the remaining goroutines start (common when cfg.Concurrency < len(manifest.Layers)), those queued workers return ctx.Err() from the early select, but g.Wait() is ignored here. If the already-running uploads succeeded, errs stays empty and Push continues on to the config/manifest uploads, so the command can report success after skipping layers and may publish an incomplete artifact.

Useful? React with 👍 / 👎.

g.Wait() now captures its return into errs (973f508), so a cancelled batch can no longer slip through to the config/manifest push with an incomplete layer set.

chatgpt-codex-connector · 2026-03-23T14:08:28Z

+	_ = g.Wait()
+	if len(errs) > 0 {
+		return fmt.Errorf("failed to pull blob to local: %w", errors.Join(errs...))


Propagate layer cancellation before finalizing pull

If the command context is cancelled between layer batches, the queued workers exit via the top-of-goroutine ctx.Done() check, but this _ = g.Wait() drops that cancellation and errs remains empty. In extract-from-remote mode that means pull can return success after extracting only a subset of files; in normal mode it can also proceed to fetch config/manifest after skipping some layers, leaving an incomplete local artifact.

Useful? React with 👍 / 👎.

Same defensive capture applied to pull in 973f508.

chatgpt-codex-connector · 2026-03-23T14:08:28Z

 			select {
 			case <-ctx.Done():
-				return ctx.Err()
+				return nil


Return context cancellation from skipped build workers

Returning nil here means any worker that never starts because the build context was already cancelled is treated as a success. Combined with the ignored eg.Wait() result below, cancelling a build after some files finish but before the rest start lets Process return no error with only a partial descriptors set, so Build can emit an apparently successful artifact that is missing files from the worktree.

Useful? React with 👍 / 👎.

Fixed in 973f508: the select now returns ctx.Err() and eg.Wait()'s return is captured into errs, so a cancelled build can no longer return a partial descriptor set as a success.

aftersnow · 2026-03-23T14:17:46Z

@codex-code-assist review

…ameters Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>

…ld/fetch Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>

…ading cancellation Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>

Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>

…artifacts Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>

… tests Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>

retry-go's OnRetry callback supplies a 1-based retry attempt number, so computeBackoff(n+1, ...) logged a delay that was one doubling ahead of the actual backoff used by the retry loop. Pass n directly. Also drop the unreachable attempt == 0 branch in computeBackoff (retry-go never supplies 0) and skip that case in the test. Addresses review feedback on PR #468 (gemini). Signed-off-by: Zhao Chen <winters.zc@antgroup.com>

The 'prefer modelspec.AnnotationFilepath, fall back to the legacy dragonflyoss key' pattern was duplicated across fetch.go, fetch_by_d7y.go, and pull_by_d7y.go (six call sites total). Centralize it in backend.getAnnotationFilepath so each caller is a one-liner and future changes to the annotation resolution live in one place. Addresses review feedback on PR #468 (gemini). Signed-off-by: Zhao Chen <winters.zc@antgroup.com>

Three call sites (push, pull, processor) used 'g.Wait()' without capturing its return, so a cancelled worker's ctx.Err() could be discarded. In the edge case where all running uploads/layers succeed but queued-but-unstarted workers exit via the ctx.Done() select, errs stayed empty and the operation continued to config/manifest push (or finalized a partial descriptor set), potentially publishing an incomplete artifact. Capture the Wait() result into errs and rely on the existing ctx.Err()/len(errs) checks to return cleanly. Also change processor/base.go's select from 'return nil' to 'return ctx.Err()' so the same propagation path applies there. Additionally, align push.go's OnRetry with pull.go/fetch.go by using pb.Placeholder (designed to reset an existing bar's message and progress) instead of pb.Add for layer, config, and manifest retries. Addresses review feedback on PR #468 (codex P1, gemini). Signed-off-by: Zhao Chen <winters.zc@antgroup.com>

The previous design used a single MaxRetryTime budget to bound both the in-flight transfer and the inter-attempt sleeps. With a wall-clock that scales with file size, a slow first attempt can consume the whole budget, leaving no room for retries — exactly when retries matter most. Split into two independent knobs: * PerAttemptTimeout: derived from file size assuming a 10 MiB/s minimum throughput with a 2x safety factor, clamped to [5min, 8h]. Each attempt gets its own context.WithTimeout, cancelled per attempt. * MaxAttempts + MaxBackoff: bound retry-only behavior. Defaults are constants (6 attempts, 2min backoff cap) and do not scale with file size — transient-failure recovery time is payload-independent. The retry loop classifies a per-attempt DeadlineExceeded under a live parent context as retryable, so a single transfer timeout no longer short-circuits the loop. User cancellation (parent ctx) still aborts immediately. CLI: --retry-max-time removed (semantics conflated transfer and retry) --retry-attempts new (int, total attempts including initial) --per-attempt-timeout new (Duration; 0 = derive from size, <0 = off) --no-retry unchanged Tests cover the size→timeout mapping with clamps, the per-attempt deadline retry loop, parent-cancel propagation, and a size-invariance test that pins the design's core property: total retry wall-clock does not depend on file size. Signed-off-by: Zhao Chen <winters.zc@antgroup.com>

aftersnow · 2026-05-06T12:30:51Z

@codex-code-assist review

No behavioral change. golangci-lint v2.5 (gci, golines with line length 122) flagged whitespace alignment in struct literals and long single-line table-driven test entries; auto-fixed via make lint-fix. Signed-off-by: Zhao Chen <winters.zc@antgroup.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fc3ac13156

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-06T12:38:21Z

+	if cfg.NoRetry {
+		return fn(ctx)
+	}


Honor per-attempt timeout when retries are disabled

Do returns early when NoRetry is true, so it calls fn(ctx) without applying PerAttemptTimeout. This means --no-retry silently disables the per-attempt deadline instead of only disabling extra attempts, and a single transfer can hang indefinitely unless the caller set its own context deadline. Users who combine --no-retry with --per-attempt-timeout will not get the timeout behavior they requested.

Useful? React with 👍 / 👎.

Good catch — fixed. The per-attempt deadline now applies on the NoRetry path too (single-attempt) since it's a transfer-bound concern, not a retry concern. Added TestDo_NoRetryHonorsPerAttemptTimeout to lock the invariant.

The previous Do() short-circuited NoRetry by calling fn(ctx) directly, bypassing the per-attempt deadline derivation. As Codex flagged, this made --no-retry silently disable the transfer timeout: a hung connection would never terminate, leaving users with a stalled CLI and no failure signal. Per-attempt timeout is a transfer-bound concern, not a retry concern, so it must apply on the single-attempt path too. Refactor the deadline setup into a small runAttempt closure used by both the NoRetry branch and retry.Do. Add TestDo_NoRetryHonorsPerAttemptTimeout to pin this invariant. Signed-off-by: Zhao Chen <winters.zc@antgroup.com>

aftersnow · 2026-05-06T12:51:07Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a specialized retrypolicy package to manage blob transfer retries, decoupling per-attempt timeouts from the retry backoff strategy. It adds CLI flags for retry configuration across the build, fetch, pull, and push commands and refactors the backend to use this new policy, improving error handling and progress tracking. A Placeholder method was added to the progress bar to handle resets during retries. Feedback suggests centralizing the progress bar reset logic within the Placeholder method and adopting the go-humanize library for byte formatting to eliminate redundant code.

gemini-code-assist · 2026-05-06T12:58:06Z

+	if existing != nil {
+		existing.msg = fmt.Sprintf("%s %s", prompt, name)
+		existing.Bar.SetCurrent(0)
+		return


The Placeholder method should encapsulate the full reset logic used during retries, including setting the refill value and resetting the EWMA speed calculation. This ensures consistency across different transfer paths and avoids manual bar manipulation in the backend packages.

Suggested change

if existing != nil {

existing.msg = fmt.Sprintf("%s %s", prompt, name)

existing.Bar.SetCurrent(0)

return

if existing != nil {

existing.msg = fmt.Sprintf("%s %s", prompt, name)

existing.SetRefill(existing.Current())

existing.SetCurrent(0)

existing.EwmaSetCurrent(0, time.Second)

return

}

gemini-code-assist · 2026-05-06T12:58:07Z

+				OnRetry: func(attempt uint, reason string, backoff time.Duration) {
+					if bar := pb.Get(layer.Digest.String()); bar != nil {
+						bar.SetRefill(bar.Current())
+						bar.SetCurrent(0)
+						bar.EwmaSetCurrent(0, time.Second)
+					}
+				},


Use the pb.Placeholder method to reset the progress bar during retries. This avoids manual manipulation of the progress bar and ensures that the reset logic (including refill and EWMA reset) is applied consistently.

OnRetry: func(attempt uint, reason string, backoff time.Duration) { pb.Placeholder(layer.Digest.String(), internalpb.NormalizePrompt("Fetching blob"), layer.Size) },

gemini-code-assist · 2026-05-06T12:58:07Z

+		OnRetry: func(attempt uint, reason string, backoff time.Duration) {
+			if bar := pb.Get(desc.Digest.String()); bar != nil {
+				bar.SetRefill(bar.Current())
+				bar.SetCurrent(0)
+				bar.EwmaSetCurrent(0, time.Second)
+			}
+		},


Use the pb.Placeholder method to reset the progress bar during retries for consistency with other transfer paths.

OnRetry: func(attempt uint, reason string, backoff time.Duration) { pb.Placeholder(desc.Digest.String(), internalpb.NormalizePrompt("Fetching blob"), desc.Size) },

gemini-code-assist · 2026-05-06T12:58:07Z

+		OnRetry: func(attempt uint, reason string, backoff time.Duration) {
+			if bar := pb.Get(desc.Digest.String()); bar != nil {
+				bar.SetRefill(bar.Current())
+				bar.SetCurrent(0)
+				bar.EwmaSetCurrent(0, time.Second)
+			}
+		},


Use the pb.Placeholder method to reset the progress bar during retries for consistency with other transfer paths.

OnRetry: func(attempt uint, reason string, backoff time.Duration) { pb.Placeholder(desc.Digest.String(), internalpb.NormalizePrompt("Pulling blob"), desc.Size) },

gemini-code-assist · 2026-05-06T12:58:07Z

+import (
+	"context"
+	"errors"
+	"fmt"
+	"math"
+	"regexp"
+	"strings"
+	"time"
+
+	retry "github.com/avast/retry-go/v4"
+	log "github.com/sirupsen/logrus"
+)


Add github.com/dustin/go-humanize to the imports to use it for byte formatting, avoiding duplication of the humanizeBytes function.

Suggested change

import (

"context"

"errors"

"fmt"

"math"

"regexp"

"strings"

"time"

retry "github.com/avast/retry-go/v4"

log "github.com/sirupsen/logrus"

)

import (

"context"

"errors"

"fmt"

"math"

"regexp"

"strings"

"time"

retry "github.com/avast/retry-go/v4"

humanize "github.com/dustin/go-humanize"

log "github.com/sirupsen/logrus"

)

References

To avoid code duplication, use functions from existing dependencies for common tasks (e.g., byte formatting) instead of reimplementing them across multiple packages.

gemini-code-assist · 2026-05-06T12:58:07Z

+func humanizeBytes(b int64) string {
+	const (
+		kb = 1024
+		mb = 1024 * kb
+		gb = 1024 * mb
+	)
+
+	switch {
+	case b >= gb:
+		return fmt.Sprintf("%.1f GB", float64(b)/float64(gb))
+	case b >= mb:
+		return fmt.Sprintf("%.1f MB", float64(b)/float64(mb))
+	case b >= kb:
+		return fmt.Sprintf("%.1f KB", float64(b)/float64(kb))
+	default:
+		return fmt.Sprintf("%d B", b)
+	}
+}


To avoid code duplication and maintain consistency, use the existing github.com/dustin/go-humanize dependency for byte formatting instead of reimplementing humanizeBytes. This follows the general rule for common tasks.

func humanizeBytes(b int64) string { return humanize.Bytes(uint64(b)) }

References

To avoid code duplication, use functions from existing dependencies for common tasks (e.g., byte formatting) instead of reimplementing them across multiple packages.

The --no-retry flag was redundant once the retry budget was decoupled from the per-attempt timeout: it expressed nothing that --retry-attempts=1 cannot. Keeping both invited the class of bug Codex found earlier (NoRetry short-circuited Do() and bypassed the per-attempt deadline) — having two ways to say "don't retry" makes it easy to wire only one of them through correctly. After this commit there are exactly two retry-related CLI flags: --retry-attempts int (default 0 → 6; set to 1 for fail-fast) --per-attempt-timeout duration (0 → derive from size; <0 → disabled) Each controls one orthogonal concern. "No retry" is a value of one of them, not a separate flag. BREAKING CHANGE: --no-retry is removed. Replace with --retry-attempts=1. The flag was introduced in this same PR and never shipped, so external breakage scope is just the PR's own iteration history. Signed-off-by: Zhao Chen <winters.zc@antgroup.com>

Remove --retry-attempts and --per-attempt-timeout. Both controlled operational behavior (retry tolerance, network speed assumption) that practically never varies per invocation, so they didn't earn a slot on the CLI. Keeping them invited misuse and grew help-text noise for no measurable benefit. After this commit, retry behavior is fully determined by package defaults in pkg/retrypolicy: - 6 attempts (DefaultMaxAttempts) - 5s initial delay, exponential backoff capped at 2 min - Per-attempt timeout derived from file size (10 MiB/s minimum throughput, 2x safety, clamped to [5min, 8h]) The retrypolicy.Config struct stays public for programmatic users embedding modctl in another binary; it just isn't wired to any CLI flag. If a real user shows up with a network where the defaults break down, we can add a knob then with their case as the documented example — until then, YAGNI wins. Cleanup: - cmd/{push,pull,build,fetch}.go: drop flag registrations. - pkg/config/{push,pull,build,fetch}.go: drop RetryConfig field and the now-unused retrypolicy import. - pkg/backend/{push,pull,fetch,build,*_by_d7y}.go: drop Config: &cfg.RetryConfig from each retrypolicy.Do call. - pkg/backend/processor/options.go: drop WithRetryConfig and the retryConfig field; processor.Process no longer takes a retry option. - pkg/backend/processor/base.go: stop threading processOpts.retryConfig into retrypolicy.DoOpts. BREAKING CHANGE: --retry-attempts and --per-attempt-timeout are removed. Both were introduced earlier in this same PR and never shipped. Signed-off-by: Zhao Chen <winters.zc@antgroup.com>

gemini-code-assist Bot reviewed Mar 23, 2026

View reviewed changes

aftersnow force-pushed the worktree-timeout-retry branch from 9f96098 to 1369e47 Compare March 23, 2026 12:55

gemini-code-assist Bot reviewed Mar 23, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Mar 23, 2026

View reviewed changes

This was referenced Mar 23, 2026

fix: data race in progress bar field access (Placeholder/Complete/Abort) #469

Open

refactor: extract duplicated annotation filepath helper #470

Open

gemini-code-assist Bot reviewed Mar 23, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Mar 23, 2026

View reviewed changes

aftersnow force-pushed the worktree-timeout-retry branch from 6ec310c to 1f6c0f5 Compare March 24, 2026 06:28

aftersnow added 8 commits March 24, 2026 14:33

feat(retrypolicy): add dynamic retry package with file-size-based par…

4e65858

…ameters Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>

feat(cli): add --no-retry and --retry-max-time flags to push/pull/bui…

92c9ca3

…ld/fetch Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>

feat(backend): use retrypolicy.Do with independent retry, remove casc…

cf1bcf2

…ading cancellation Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>

refactor: remove legacy defaultRetryOpts, delegate to retrypolicy

be391fd

Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>

fix(push): add OnRetry handler for config and manifest retry

2fcdba2

Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>

fix: add non-retryable local errors and thread retryCtx in pull

4cdd8ae

Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>

fix: propagate context cancellation after Wait to prevent incomplete …

5047acc

…artifacts Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>

fix: handle DeadlineExceeded in IsRetryable, fix ctx shadow, speed up…

fcdbcb0

… tests Signed-off-by: Zhao Chen <zhaochen.zju@gmail.com>

aftersnow force-pushed the worktree-timeout-retry branch from 1f6c0f5 to fcdbcb0 Compare March 24, 2026 06:34

aftersnow added 3 commits April 23, 2026 12:50

aftersnow marked this pull request as draft April 23, 2026 05:58

chatgpt-codex-connector Bot reviewed May 6, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

aftersnow added 2 commits May 6, 2026 20:58

	backoff := computeBackoff(n+1, initialDelay, maxBackoff)
	backoff := computeBackoff(n, initialDelay, maxBackoff)

Conversation

aftersnow commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Design

Changes

Test plan

Uh oh!

gemini-code-assist Bot commented Mar 23, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aftersnow commented Mar 23, 2026

Uh oh!

aftersnow commented Mar 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aftersnow commented Mar 23, 2026 •

edited

Loading