Skip to content

fix: prevent git task workers from diverging during sync fan-out (#9349)#9360

Open
polmichel wants to merge 15 commits into
stablefrom
pmi-IFC-2642-task-workers-diverge
Open

fix: prevent git task workers from diverging during sync fan-out (#9349)#9360
polmichel wants to merge 15 commits into
stablefrom
pmi-IFC-2642-task-workers-diverge

Conversation

@polmichel
Copy link
Copy Markdown
Contributor

@polmichel polmichel commented May 26, 2026

Why

In a multi-worker task pool, the git sync flow fans work out across workers. When a new upstream commit landed between the orchestrator resolving HEAD and the other workers running their own git pull, workers ended up on different commits for the same repo.

Goal: make every fan-out worker converge on the exact commit the orchestrator pinned, regardless of what lands upstream during fan-out.

This PR also carries a second, independent git fix (see below).

Closes #9349

What changed

Behavioral

  • The sync orchestrator now records the commit SHA it just synced and broadcasts it in RefreshGitFetch. Receiving workers check out that exact SHA instead of fast-forwarding to whatever upstream HEAD currently is.
  • Separate fix: pulling a repository that had to create a missing local branch crashed when recording its commit value (an unbound infrahub_branch). The branch is now bound before the commit is written.

How to review

A commit by commit approach would be relevant since there is:

  • One commit for the initial bug fix 93e86eb and one for the related test 7e2f139
  • One commit for the collateral bug and related test 308110c
  • One commit for the refactor 419ef91

Following commits are unitary fix or similar issue using the new pinned commit, or documentation/lint/format matter.
Extra scrutiny welcome on the pin source per flow.

How to test

# Divergence reproduction (fails before the fix, passes after)
uv run pytest backend/tests/component/message_bus/operations/git/test_refresh_git_fetch.py -x -v

# Missing-branch pull regression
uv run pytest backend/tests/component/git/test_git_repository.py -x -v

Known follow-ups (intentionally out of scope)

  • That read-only flow broadcasts repository_kind=REPOSITORY for a read-only repo. Worth confirming independently.

Checklist

  • Tests added/updated
  • Changelog entries added (9349.fixed.md, +pull-new-branch-unbound-commit.fixed.md)
  • External docs updated (no user-facing/ops-facing surface change)
  • Internal .md docs updated (no knowledge-doc change needed)
  • I have reviewed AI generated content

Summary by cubic

Prevents git task workers from diverging during sync fan-out by pinning the orchestrator’s commit and hard-resetting workers to that exact SHA under a repository lock. Also fixes a branch-creation pull crash and ensures commit values are recorded. Addresses IFC-2642.

  • Bug Fixes
    • Added optional commit to RefreshGitFetch; workers now fetch and hard-reset via reset_to_commit under the repo lock, falling back to pull if the pin is missing or can’t be resolved.
    • Applied pinning across periodic sync, repository add (read/write and read-only), branch create, merge, and read-only pull; ref-only read-only flows resolve a concrete SHA once and broadcast it.
    • Fixed missing-branch pull crash by binding the branch before writing its commit; the commit value now updates when a new local branch is created.
    • Narrowed pin-resolution errors to ValueError and InvalidGitRepositoryError; added a fan-out convergence test and a commit-update test; updated RPC mocks and docs for the new commit field.

Written for commit 23e08ed. Summary will update on new commits. Review in cubic

polmichel and others added 9 commits May 27, 2026 00:21
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
get_commit_value only raises ValueError (branch absent) or
InvalidGitRepositoryError (clone missing/corrupt). Catching the GitError
base class also swallowed GitCommandError, HookExecutionError, and other
unexpected git failures, masking them behind a silent fall-back to pull().
Narrow both pin-resolution sites to the exceptions actually expected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After merging into the destination branch, resolve the resulting commit and
broadcast it in RefreshGitFetch so fan-out workers check out the merge commit
instead of pulling the destination branch to whatever upstream HEAD is at that
moment, keeping the pool converged if upstream advances during fan-out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolve the new branch's commit and broadcast it in RefreshGitFetch so fan-out
workers check out that exact SHA rather than pulling the branch to whatever
upstream HEAD is at fetch time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the group/backend Issue related to the backend (API Server, Git Agent) label May 26, 2026
await repo.sync_from_remote()

# Notify other workers they need to clone the repository
notification = messages.RefreshGitFetch(

This comment was marked as outdated.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 8 files

Confidence score: 3/5

  • There is a concrete regression risk in backend/infrahub/git/tasks.py: pull_read_only broadcasts commit=model.commit rather than the commit actually synced, which can mislead downstream workers.
  • The most severe impact is user-facing sync divergence: when model.commit is None, fan-out workers may fall back to pull() and reintroduce drift instead of converging on the intended revision.
  • Given the issue’s relatively high severity/confidence (7/10, 8/10), this is a meaningful but likely targeted fix rather than a broad systemic failure, so merge risk is moderate.
  • Pay close attention to backend/infrahub/git/tasks.py - ensure the broadcasted commit matches the synced commit to prevent fallback pulls and divergence.

Shadow auto-approve: would not auto-approve because issues were found.

Re-trigger cubic

Comment thread backend/infrahub/git/tasks.py Outdated
@polmichel polmichel marked this pull request as ready for review May 26, 2026 22:59
@polmichel polmichel requested a review from a team as a code owner May 26, 2026 22:59
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 26, 2026

Merging this PR will not alter performance

✅ 12 untouched benchmarks


Comparing pmi-IFC-2642-task-workers-diverge (23e08ed) with stable (edaeba7)1

Open in CodSpeed

Footnotes

  1. No successful run was found on stable (969810f) during the generation of this report, so edaeba7 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

polmichel and others added 3 commits May 27, 2026 01:02
Add the generated row for the new commit field and tighten its description
to a single line so the message-bus events reference matches generation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A ref-only pull (commit unset) resolved a concrete SHA during sync but
broadcast the unset commit, leaving fan-out workers to re-resolve the ref
independently and diverge. Resolve the ref once and use that SHA for both
the sync and the broadcast.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolve the ref to a concrete SHA once and use it for both the initial sync
and the RefreshGitFetch broadcast, so fan-out workers cloning the new
read-only repository converge on that commit instead of re-resolving the ref.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@polmichel polmichel requested a review from a team as a code owner May 26, 2026 23:10
@github-actions github-actions Bot added the type/documentation Improvements or additions to documentation label May 26, 2026
@@ -145,6 +157,7 @@ async def add_git_repository_read_only(model: GitRepositoryAddReadOnly) -> None:
repository_kind=InfrahubKind.REPOSITORY,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an issue that this is tagged as REPOSITORY and not a read-only repository type?

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="docs/docs/reference/message-bus-events.mdx">

<violation number="1" location="docs/docs/reference/message-bus-events.mdx:82">
P2: The Type column for the `commit` field is set to "N/A", but a commit SHA is a string value. This is inconsistent with the other fields in the same table (all typed as "string").</violation>
</file>

Shadow auto-approve: would not auto-approve because issues were found.

Re-trigger cubic

| **repository_kind** | The type of repository | string | None |
| **infrahub_branch_name** | Infrahub branch on which to sync the remote repository | string | None |
| **infrahub_branch_id** | Id of the Infrahub branch on which to sync the remote repository | string | None |
| **commit** | Commit SHA to check out, pinned by the sync orchestrator instead of pulling the latest upstream HEAD | N/A | None |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The Type column for the commit field is set to "N/A", but a commit SHA is a string value. This is inconsistent with the other fields in the same table (all typed as "string").

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/docs/reference/message-bus-events.mdx, line 82:

<comment>The Type column for the `commit` field is set to "N/A", but a commit SHA is a string value. This is inconsistent with the other fields in the same table (all typed as "string").</comment>

<file context>
@@ -79,6 +79,7 @@ For more detailed explanations on how these events are used within Infrahub, see
 | **repository_kind** | The type of repository | string | None |
 | **infrahub_branch_name** | Infrahub branch on which to sync the remote repository | string | None |
 | **infrahub_branch_id** | Id of the Infrahub branch on which to sync the remote repository | string | None |
+| **commit** | Commit SHA to check out, pinned by the sync orchestrator instead of pulling the latest upstream HEAD | N/A | None |
 <!-- vale on -->
 <!-- vale off -->
</file context>
Suggested change
| **commit** | Commit SHA to check out, pinned by the sync orchestrator instead of pulling the latest upstream HEAD | N/A | None |
| **commit** | Commit SHA to check out, pinned by the sync orchestrator instead of pulling the latest upstream HEAD | string | None |

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation is auto-generated. This is surely due to the fact that the field as a default value at None, but I am not totally sure.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context—understood that this table is auto-generated.

@polmichel polmichel marked this pull request as draft May 27, 2026 07:09
polmichel and others added 2 commits May 27, 2026 09:17
The add and read-only-add flows now resolve a commit and broadcast it, so the
mocked repos must return a SHA from get_commit_value (and expose ref for the
read-only flow), and the read-only sync assertion expects the pinned commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 1 file (changes from recent commits).

Shadow auto-approve: would require human review. This PR modifies core git synchronization logic across multiple workers, introducing commit pinning and a new reset_to_commit method to prevent divergence; such changes affect a critical path for repository data integrity and require human review due to potential impact on fan-out behavior...

Re-trigger cubic

@polmichel polmichel marked this pull request as ready for review May 27, 2026 07:43
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 1 file (changes from recent commits).

Shadow auto-approve: would require human review. This PR modifies core git synchronization logic to prevent worker divergence, introducing a new reset_to_commit method, altering pull and message broadcast behavior, and changing concurrency handling under locks; while well-tested, the blast radius and risk to data integrity across all...

Re-trigger cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

group/backend Issue related to the backend (API Server, Git Agent) type/documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: task workers diverge when upstream commit lands between sync start and fan-out git pulls

1 participant