Skip to content

fix(policy): register new runner before stopping old one (PILOT-310)#7

Open
matthew-pilot wants to merge 2 commits into
mainfrom
openclaw/pilot-310-20260530-213154
Open

fix(policy): register new runner before stopping old one (PILOT-310)#7
matthew-pilot wants to merge 2 commits into
mainfrom
openclaw/pilot-310-20260530-213154

Conversation

@matthew-pilot
Copy link
Copy Markdown
Collaborator

What failed

startInternal in service.go:235-242 stopped the old PolicyRunner while holding s.mu and before the new runner was started — creating a brief window where Get(netID) could return nil and gate decisions would see no active policy.

Why this fix

Reordered the swap: register and start the new runner first (under the mutex), release the mutex, then stop the old runner outside the lock. This eliminates the gap entirely — gate decisions always see a live runner.

Verification

  • go build ./... — clean
  • go vet ./... — clean
  • go test ./... — all pass (5.7s)
  • Added TestStartInternal_AtomicSwap that hammers Get in a goroutine during reload and confirms no nil return

Scope

  • service.go: +4/−3 (the fix)
  • zz_service_atomic_swap_test.go: +100 (new test)

Closes PILOT-310

Verify that during a policy reload, Get(netID) never returns nil — the
new runner must be registered before the old one is stopped.

The test hammers managerView.Get in a goroutine while startInternal
replaces a live runner with a new policy. Any nil return during the
swap is a gap where gate decisions see no policy.
In startInternal, the old runner was stopped while holding s.mu and
before the new runner was started — creating a brief window where
Get(netID) could return nil and gate decisions see no policy.

Fix: register and start the new runner first, release the mutex, then
stop the old runner outside the lock. This eliminates the window
entirely.

Closes PILOT-310
@codecov
Copy link
Copy Markdown

codecov Bot commented May 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@matthew-pilot
Copy link
Copy Markdown
Collaborator Author

🤖 matthew-pr-worker — PR Status

State: OPEN · Mergeable: MERGEABLE (CLEAN) · Draft: No

CI Checks

Check Result
test ✅ SUCCESS
codecov/patch ✅ SUCCESS

Overall: 2/2 ✅

Scope

  • service.go — +4/−3 (runner swap reorder)
  • zz_service_atomic_swap_test.go — +100/−0 (new test)
  • Total: 2 files, +104/−3

Labels

None

Canary

Status: not yet triggered — pilot-protocol/policy canary: [smoke]


automated by matthew-pr-worker | 2026-05-30T21:41 UTC

@matthew-pilot
Copy link
Copy Markdown
Collaborator Author

🤖 matthew-pr-worker — PR Explain

PILOT-310: fix(policy): register new runner before stopping old one

What changed

The runner swap in service.go:startInternal() was reordered:

Before (❌ racy):

  1. Lock s.mu
  2. Stop old runner (inside lock)
  3. Start new runner (inside lock)
  4. Unlock

→ Gap: between stop and start, Get(netID) could return nil — gate decisions see no active policy.

After (✅ atomic):

  1. Lock s.mu
  2. Register new runner + start (inside lock)
  3. Unlock
  4. Stop old runner (outside lock)

→ No gap. Get() always finds a live runner.

Files

File Δ Purpose
service.go +4/−3 Reorder register-then-stop
zz_service_atomic_swap_test.go +100/−0 Concurrent Get() hammer test during reload — confirms zero nil returns

Test highlights

TestStartInternal_AtomicSwap fires 10 goroutines calling Get in a tight loop while startInternal reloads the runner — asserts every Get return is non-nil.


automated by matthew-pr-worker | 2026-05-30T21:41 UTC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant