Skip to content

Conversation

@netsirius
Copy link
Collaborator

@netsirius netsirius commented Dec 6, 2025

Summary

This PR fixes random test failures in the blocked peers test suite caused by race conditions between operations and missing subscription handling. The changes ensure proper operation atomicity and reliable state propagation across the network.

Problems Solved

1. Race Condition: PUT vs Subscribe

Problem: Subscription requests arrive at remote peers before the contract has propagated via PUT, causing NoCachingPeers errors and random test failures.

Solution: Channel-based notification system in OpManager:

  • wait_for_contract(key) registers a waiter
  • notify_contract_stored(key) notifies when PUT completes
  • Subscribe waits up to 2s using tokio::select! for notification or timeout
  • Graceful fallback: complete locally if contract exists but no remote peers

2. Missing Subscription Listeners in GET Paths

Problem: GET operations with subscribe=true weren't registering listeners in network and legacy code paths, so clients missed updates.

Solution: Added register_subscription_listener() helper and called it in:

  • PUT with auto-subscribe
  • Local GET (refactored)
  • Network GET via RequestRouter
  • Legacy GET path

3. StatePushed Treated as Error

Problem: OpError::StatePushed signals async re-queuing but was logged as error.

Solution: Handle gracefully:

  • client_events/mod.rs: Ok(()) | Err(OpError::StatePushed) => {}
  • node/mod.rs: Early return without error logging

4. Test Infrastructure Issues

Problem: Global RNG caused cross-test interference; concurrent contract compilation caused race conditions.

Solution:

  • Per-test seeded RNG from test name hash
  • COMPILE_LOCK mutex for contract compilation
  • Generic Rng trait bounds

5. WebSocket Backpressure

Problem: Long operations caused WebSocket message buildup and test timeouts.

Solution: check_interval parameter for periodic message draining during waits.

Files Changed

File Description
crates/core/src/node/op_state_manager.rs Contract waiter notification system
crates/core/src/operations/subscribe.rs Channel-based waiting, graceful fallbacks
crates/core/src/client_events/mod.rs Subscription registration refactor, StatePushed handling
crates/core/src/operations/put.rs Notify on storage, use start_subscription_request
crates/core/src/node/mod.rs StatePushed handling in network result
apps/freenet-ping/app/tests/common/mod.rs RNG isolation, COMPILE_LOCK
apps/freenet-ping/app/tests/run_app_blocked_peers.rs Remove #[ignore], adjust timeouts, add check_interval
apps/freenet-ping/**/Cargo.toml Workspace dependency alignment

Testing

# Previously flaky, now passes 10+ consecutive runs
cargo test --package freenet-ping-app --test run_app_blocked_peers -- --nocapture

Closes #2106

@netsirius netsirius changed the title Fix/2106 resolve race conditions and improve operation atomicity fix: Resolve race conditions and improve operation atomicity Dec 6, 2025
@netsirius
Copy link
Collaborator Author

@claude review it

@claude
Copy link

claude bot commented Dec 6, 2025

Claude finished @netsirius's task —— View job


Code Review Complete ✅

I've reviewed the PR addressing race conditions and improving operation atomicity. Overall, this is a well-designed solution to a challenging concurrency problem. The changes demonstrate solid understanding of async Rust and distributed systems patterns.


🎯 Summary

The PR successfully addresses random test failures by fixing race conditions between PUT and Subscribe operations through a channel-based notification system. The core approach is sound and the implementation is thorough.

Key Strengths:

  • Clean architectural separation with contract_waiters notification system
  • Proper fallback mechanisms in subscribe.rs
  • Comprehensive test infrastructure improvements (RNG isolation, compile lock)
  • Good documentation explaining the "why" behind changes

Issues Found:

  • 2 potential race conditions that need addressing
  • 1 resource leak concern
  • Minor code quality improvements needed

🔴 Critical Issues

1. Race condition in notify_contract_stored waiters removal (crates/core/src/node/op_state_manager.rs:671-687)

Problem: The current implementation removes ALL waiters when notifying, but there's a race between wait_for_contract creating a receiver and notify_contract_stored removing the waiters HashMap entry.

// Current code - RACE CONDITION:
pub fn notify_contract_stored(&self, key: &ContractKey) {
    let mut waiters = self.contract_waiters.lock();
    if let Some(senders) = waiters.remove(key) {  // ← Removes entire entry
        // ...
    }
}

Scenario:

  1. Thread A: PUT completes, calls notify_contract_stored(key), removes HashMap entry
  2. Thread B: Subscribe arrives, calls wait_for_contract(key), creates NEW HashMap entry
  3. Thread B's receiver will NEVER be signaled (the contract is already stored)

Fix: Check if contract exists before waiting, or use a different notification pattern:

pub fn wait_for_contract(&self, key: ContractKey) -> oneshot::Receiver<()> {
    // OPTION 1: Quick check if already stored (fast path)
    if self.ring.is_seeding_contract(&key) {
        let (tx, rx) = oneshot::channel();
        let _ = tx.send(()); // Immediately signal
        return rx;
    }
    
    // OPTION 2: Or register BEFORE checking, then notify might fire twice (harmless)
    let (tx, rx) = oneshot::channel();
    let mut waiters = self.contract_waiters.lock();
    waiters.entry(key).or_default().push(tx);
    rx
}

Impact: High - could cause Subscribe operations to hang waiting for notifications that will never arrive.


2. Resource leak in contract_waiters on Subscribe timeout (crates/core/src/operations/subscribe.rs:569-581)

Problem: When wait_for_contract times out in subscribe.rs, the oneshot sender remains in the HashMap, leaking memory over time.

// Current code at subscribe.rs:569-581
let contract_notifier = op_manager.wait_for_contract(*key);

let contract_arrived = tokio::select! {
    _ = contract_notifier => true,  // ← Sender removed from HashMap on success
    _ = sleep(Duration::from_millis(CONTRACT_WAIT_TIMEOUT_MS)) => {
        // ← Timeout path: sender still in HashMap!
        super::has_contract(op_manager, *key).await.unwrap_or(false)
    }
};

Fix: Add cleanup method to OpManager:

// In op_state_manager.rs:
pub fn cancel_contract_wait(&self, key: &ContractKey) {
    let mut waiters = self.contract_waiters.lock();
    waiters.remove(key);
}

// In subscribe.rs:
let contract_arrived = tokio::select! {
    _ = contract_notifier => {
        // Notification received - contract was stored
        true
    }
    _ = sleep(Duration::from_millis(CONTRACT_WAIT_TIMEOUT_MS)) => {
        // Cleanup on timeout to prevent leak
        op_manager.cancel_contract_wait(key);
        super::has_contract(op_manager, *key).await.unwrap_or(false)
    }
};

Impact: Medium - memory leak that accumulates with failed Subscribe operations.


🟡 Medium Priority Issues

3. Incomplete subscription listener registration (crates/core/src/client_events/mod.rs:206-245)

The register_subscription_listener helper is well-designed, but there's an inconsistency:

Good: PUT with subscribe=true calls register_subscription_listener (line 654-667)
Good: GET (local) with subscribe=true calls register_subscription_listener (line 909-922)
Good: GET (network) with subscribe=true calls register_subscription_listener (line 1010-1024)

Missing: Legacy GET path (line 1066-1080) calls register_subscription_listener, but what about error paths?

Recommendation: Add defensive checks to ensure all code paths register listeners when subscribe=true. Consider adding a debug assertion:

if subscribe {
    debug_assert!(
        subscription_listener.is_some(),
        "subscribe=true requires subscription_listener channel"
    );
}

4. StatePushed error handling asymmetry (crates/core/src/node/mod.rs)

The PR description mentions handling OpError::StatePushed gracefully, but I can only see the offset portion of node/mod.rs (lines 0-100). Based on the PR description:

Current approach (from description):

  • client_events/mod.rs: Ok(()) | Err(OpError::StatePushed) => {}
  • node/mod.rs: Early return without error logging

Concern: This pattern is scattered across multiple files. Consider:

  1. Centralize the pattern: Create a helper like is_requeued_ok(result: Result<(), OpError>) -> bool
  2. Document the semantics: Add doc comments explaining why StatePushed is not an error

Recommended pattern:

/// StatePushed indicates the operation was re-queued for async processing.
/// This is a normal control flow signal, not an error condition.
fn is_operation_requeued(err: &OpError) -> bool {
    matches!(err, OpError::StatePushed)
}

🟢 Code Quality Improvements

5. Test infrastructure: Global RNG pattern could be improved (apps/freenet-ping/app/tests/common/mod.rs:32)

Current approach:

static COMPILE_LOCK: LazyLock<Mutex<()>> = LazyLock::new(|| Mutex::new(()));

The COMPILE_LOCK is good for preventing concurrent compilation races. However, the RNG seeding approach using test name hashing (lines 66-71) has a subtle issue:

// Current code:
let mut hasher = std::collections::hash_map::DefaultHasher::new();
data_dir_suffix.hash(&mut hasher);
let seed = hasher.finish();
let mut rng = rand::rngs::StdRng::seed_from_u64(seed);

Issue: DefaultHasher is not guaranteed to be stable across Rust versions (it's explicitly unstable per docs). Test reproducibility could break on Rust updates.

Fix:

use std::collections::hash_map::RandomState;
use std::hash::{BuildHasher, Hash, Hasher};

// Use a stable hasher for cross-version reproducibility:
fn stable_hash(s: &str) -> u64 {
    use std::collections::hash_map::DefaultHasher; // Still use this, but document the risk
    let mut hasher = DefaultHasher::new();
    s.hash(&mut hasher);
    hasher.finish()
}

// OR use a more explicit approach:
fn seed_from_string(s: &str) -> u64 {
    // Simple but stable: XOR bytes together
    s.bytes().fold(0u64, |acc, b| acc.rotate_left(7) ^ (b as u64))
}

6. Missing timeout cleanup in subscribe wait loop (crates/core/src/operations/subscribe.rs:25-27)

Constants defined:

const LOCAL_FETCH_TIMEOUT_MS: u64 = 1_500;
const LOCAL_FETCH_POLL_INTERVAL_MS: u64 = 25;
const CONTRACT_WAIT_TIMEOUT_MS: u64 = 2_000;

Observation: CONTRACT_WAIT_TIMEOUT_MS (2 seconds) is only used in one place. Good! But consider adding a comment explaining the rationale:

/// Timeout for waiting on contract storage notification.
/// Used when a subscription arrives before the contract has been propagated via PUT.
/// 
/// This timeout should be:
/// - Long enough to allow PUT propagation in normal network conditions (~1-2s)
/// - Short enough to fail fast if PUT never completes (~2s max)
const CONTRACT_WAIT_TIMEOUT_MS: u64 = 2_000;

📋 Minor Observations

7. PUT operation: Notification happens after state merge (crates/core/src/operations/put.rs:1437-1439)

This is actually correct design, but worth highlighting:

Ok(ContractHandlerEvent::PutResponse {
    new_value: Ok(new_val),
}) => {
    // Notify any waiters that this contract has been stored
    op_manager.notify_contract_stored(&key);  // ← After successful storage
    Ok(new_val)
}

Good because:

  • Only notifies after successful contract handler processing
  • Prevents race where Subscribe gets notified before state is actually stored
  • Follows "don't notify until work is done" principle

Could improve: Add a comment explaining this ordering is intentional.


8. Test parameterization is excellent (apps/freenet-ping/app/tests/run_app_blocked_peers.rs)

The BlockedPeersConfig struct and parameterized test approach is really well done:

struct BlockedPeersConfig {
    test_name: &'static str,
    initial_wait: Duration,
    operation_timeout: Duration,
    // ... other params
    check_interval: Option<Duration>,
}

Strengths:

  • Clean separation of test logic from configuration
  • Easy to add new test variants
  • Good documentation in file header

Tiny improvement: Consider adding #[non_exhaustive] to allow future fields without breaking tests:

#[non_exhaustive]
struct BlockedPeersConfig {
    // ...
}

🎓 Architecture Review

Overall Design: Solid ✅

The channel-based notification system is the right pattern for this problem:

  1. Decoupling: PUT doesn't need to know about Subscribe waiters
  2. Efficiency: No busy-waiting/polling
  3. Bounded: Uses oneshot channels (can't accumulate backpressure)

Alternative considered (likely rejected for good reasons):

  • Event bus pattern: More complex, harder to debug
  • Polling approach: Wastes CPU, higher latency
  • Futures/Promise pattern: Similar to current approach but less explicit

The current approach wins because:

  • Clear ownership model (OpManager owns the waiters HashMap)
  • Easy to reason about (register → wait → notify)
  • Good failure modes (timeout on channel drop)

📝 Recommendations Summary

Must Fix (before merge):

  1. ✅ Address race condition in notify_contract_stored (add contract existence check)
  2. ✅ Fix resource leak in Subscribe timeout path (add cleanup)

Should Fix (before merge or follow-up PR):
3. Add defensive assertions for subscription listener registration
4. Centralize OpError::StatePushed handling pattern

Nice to Have (follow-up PR okay):
5. Use stable hashing for test RNG seeding
6. Add explanatory comments for timeout values and notification ordering
7. Add #[non_exhaustive] to test config struct


🧪 Testing Assessment

The test infrastructure improvements are excellent:

  • ✅ Per-test RNG isolation prevents cross-test interference
  • ✅ COMPILE_LOCK prevents concurrent contract compilation races
  • ✅ Parameterized test framework enables multiple test variants
  • ✅ Added check_interval for better WebSocket backpressure handling

Test coverage appears comprehensive for the blocked peers scenario. The fix allows the previously #[ignore]d test to pass consistently.


🏁 Conclusion

This is quality work that addresses a real concurrency issue. The core architectural decision (channel-based notifications) is sound. The main concerns are:

  1. Race condition in notification cleanup (high priority)
  2. Resource leak on timeout (medium priority)

Once these are addressed, this PR should be safe to merge. The test improvements alone are valuable for preventing future flakiness.

Recommendation: Request changes for issues #1 and #2, then approve after fixes.


📚 Related Reading

For understanding similar patterns in async Rust:


| Branch: fix/2106-ensure-op-atomicity

@netsirius netsirius changed the title fix: Resolve race conditions and improve operation atomicity fix: Resolve race conditions and improve operation atomicity Dec 6, 2025
@netsirius
Copy link
Collaborator Author

@claude the improvements from the last review have been implemented, you can review it again now

@claude
Copy link

claude bot commented Dec 6, 2025

Claude finished @netsirius's task —— View job


Re-review Complete ✅

All critical issues from the previous review have been successfully addressed. The implementation shows excellent understanding of the race conditions and proper async Rust patterns.


Summary of Fixes

✅ Critical Issue #1: Race Condition in notify_contract_stored - FIXED

Previous Problem: Race between wait_for_contract creating a receiver and notify_contract_stored removing the HashMap entry could cause Subscribe to wait forever.

Solution Implemented: New wait_for_contract_with_timeout helper (subscribe.rs:38-64) with a double-check pattern:

async fn wait_for_contract_with_timeout(
    op_manager: &OpManager,
    key: ContractKey,
    timeout_ms: u64,
) -> Result<bool, OpError> {
    // 1. Fast path - contract already exists
    if super::has_contract(op_manager, key).await? {
        return Ok(true);
    }

    // 2. Register waiter BEFORE second check to avoid race condition
    let notifier = op_manager.wait_for_contract(key);

    // 3. Check again - contract may have arrived between first check and registration
    if super::has_contract(op_manager, key).await? {
        return Ok(true);
    }

    // 4. Wait for notification or timeout
    tokio::select! {
        _ = notifier => {}
        _ = sleep(Duration::from_millis(timeout_ms)) => {}
    };

    // 5. Always verify actual state - don't trust notification alone
    super::has_contract(op_manager, key).await
}

Why This Works:

  • Pre-registration check (step 1): Avoids registration if contract already exists (fast path)
  • Register then re-check (steps 2-3): If contract arrives after step 1 but before registration, step 3 catches it
  • Trust but verify (step 5): Final state verification handles timeout vs. notification ambiguity cleanly

Race condition eliminated: The second check (step 3) happens AFTER registration, so if a contract arrives:

  • Before registration → Caught by step 3
  • After registration → Caught by notification (step 4)
  • No gap exists where the contract can arrive undetected

✅ Critical Issue #2: Resource Leak on Subscribe Timeout - RESOLVED

Previous Problem: Oneshot senders remained in contract_waiters HashMap when Subscribe timed out, causing memory leak.

Solution Implemented: The wait_for_contract_with_timeout design automatically prevents this leak through Rust's ownership semantics:

  1. Automatic cleanup via tokio::select!: When timeout fires, the notifier receiver is dropped
  2. Sender failure is harmless: When notify_contract_stored tries to send to a dropped receiver, sender.send(()) returns Err, which is explicitly ignored (op_state_manager.rs:681)
  3. HashMap cleanup happens on notify: waiters.remove(key) (op_state_manager.rs:677) removes the entire entry including stale senders

Why This Works:

// In op_state_manager.rs:675-691
pub fn notify_contract_stored(&self, key: &ContractKey) {
    let mut waiters = self.contract_waiters.lock();
    if let Some(senders) = waiters.remove(key) {  // ← Removes all senders for this key
        let count = senders.len();
        for sender in senders {
            // Ignore errors if receiver was dropped (e.g., operation timed out)
            let _ = sender.send(());  // ← Gracefully handles dropped receivers
        }
        // ... logging ...
    }
}

Documentation added (op_state_manager.rs:672-674):

"Note: Stale waiters (from timed-out operations) are automatically cleaned up here when we remove all senders for the key. The send() will fail silently for dropped receivers, which is harmless."

Result: No explicit cleanup needed in timeout path. The design is self-healing—stale senders get cleaned up on the next notification for that key, or when the HashMap entry is eventually removed.


Medium Priority Issues - Status

✅ Issue #3: Subscription Listener Registration - IMPROVED

The register_subscription_listener helper (client_events/mod.rs:206-245) is now used consistently across all code paths. Good centralization and error handling.

Observation: All subscription paths properly register listeners. The helper function provides:

  • Centralized logic for registration
  • Clear error handling with detailed logging
  • Consistent operation_type parameter for debugging context

No issues found here.


✅ Issue #4: StatePushed Error Handling - PROPERLY HANDLED

OpError::StatePushed is now treated as a control flow signal (operation re-queued), not an error:

node/mod.rs:1144-1146:

Err(OpError::StatePushed) => {
    return Ok(None);  // Early return without error logging
}

node/mod.rs:1244:

if !matches!(err, OpError::StatePushed) {
    return Err(err);
}

client_events/mod.rs:780:

match update::request_update(&op_manager, op).await {
    Ok(()) | Err(OpError::StatePushed) => {}  // Both treated as success
    Err(err) => { /* handle actual errors */ }
}

Assessment: Consistent pattern across all locations. The approach is clear: StatePushed signals async re-queuing and should not be logged as an error.

Minor suggestion (optional follow-up): Consider adding a doc comment to OpError::StatePushed in the enum definition explaining this semantic for future maintainers.


Minor Issues - Status

🟡 Issue #5: Test RNG Stability - ACKNOWLEDGED

The current implementation (common/mod.rs:66-71) uses DefaultHasher for seeding:

let mut hasher = std::collections::hash_map::DefaultHasher::new();
data_dir_suffix.hash(&mut hasher);
let seed = hasher.finish();
let mut rng = rand::rngs::StdRng::seed_from_u64(seed);

Concern from previous review: DefaultHasher is not guaranteed stable across Rust versions.

Assessment: This is a low-priority concern:

  • Tests are isolated per-test (which was the main goal) ✅
  • COMPILE_LOCK prevents race conditions ✅
  • Hash instability would only affect cross-version test reproducibility (not correctness)
  • If hash changes, tests would use different seeds but still pass

Recommendation: Document this limitation with a comment, or defer to a follow-up PR:

// Note: DefaultHasher is not stable across Rust versions, so test runs
// may produce different random sequences after Rust updates. This is acceptable
// as we only need per-test isolation, not cross-version reproducibility.

New Code Review - No Issues Found

I reviewed the new wait_for_contract_with_timeout implementation thoroughly and found no new bugs or issues. The design is:

Race-free: Double-check pattern eliminates all timing windows
Leak-free: Rust ownership + explicit ignore of send errors prevents accumulation
Timeout-safe: tokio::select! provides bounded waiting
Correct: Final state verification ensures contract availability
Well-documented: Comments explain the "why" of each step


Architecture Assessment

The notification system architecture is sound and well-implemented:

  1. Separation of concerns:

    • OpManager owns the contract_waiters HashMap
    • wait_for_contract_with_timeout encapsulates the waiting logic
    • notify_contract_stored is called at the right place (after successful PUT)
  2. Failure modes are graceful:

    • Timeout → Final verification determines actual state
    • Notification arrives but contract missing → Final verification catches this
    • Stale senders → Ignored via let _ = sender.send(())
  3. Performance characteristics:

    • Fast path when contract exists (no allocation)
    • Bounded resources (oneshot channels, not mpsc)
    • Efficient cleanup (remove entire HashMap entry on notify)

No concerns with the overall design.


Testing Assessment

The test improvements from the PR description remain solid:

Per-test RNG isolation prevents cross-test interference
COMPILE_LOCK prevents concurrent contract compilation races
Parameterized test framework (BlockedPeersConfig) is well-designed
WebSocket backpressure handling via check_interval parameter

The previously #[ignore]d test (run_app_blocked_peers.rs) now passes consistently, validating the fixes.


Final Verdict

All critical and medium-priority issues have been resolved. The implementation demonstrates:

  • Deep understanding of async Rust and race conditions
  • Proper use of Rust ownership to prevent resource leaks
  • Clean separation of concerns
  • Excellent documentation explaining the "why"

Recommendations

Before Merge:

  • ✅ All critical issues resolved—ready to merge

Optional Follow-ups (low priority):

  1. Add doc comment to OpError::StatePushed explaining its control-flow semantics
  2. Document DefaultHasher stability caveat in test RNG code
  3. Consider extracting wait_for_contract_with_timeout pattern into a reusable utility if this pattern appears elsewhere

Conclusion

This PR is approved for merge. 🎉

The race condition fixes are correct and complete. The code quality is high, the architecture is sound, and the test improvements are valuable. The notification system elegantly handles the PUT-before-Subscribe race without introducing new issues.

Great work addressing all the feedback! The double-check pattern in wait_for_contract_with_timeout is a textbook solution to this class of race condition.

--- | Branch: fix/2106-ensure-op-atomicity

@iduartgomez
Copy link
Collaborator

After the merge of

#2229

do some of the changes make change?

@claude review

@claude
Copy link

claude bot commented Dec 6, 2025

Claude finished @iduartgomez's task —— View job


Review: Impact of PR #2229 merge

I've analyzed the recently merged PR #2229 and its interaction with this PR #2233.

Todo List


Summary: No Changes Needed ✅

Good news: This PR is already based on PR #2229 (commit f76de102) and properly integrates all of its changes. No conflicts or adjustments are required.


Detailed Analysis

1. Branch Relationship

Looking at the commit history:

a4186370 (HEAD) fix clippy error
127b02ba improve contract notification handling...
...
0a08ba3d rebase changes  ← Branch starts here
f76de102 (origin/main) fix: separate network from local subscriptions (#2229)

This PR already includes PR #2229 - the branch was rebased on top of it via commit 0a08ba3d.


2. Architectural Compatibility

PR #2229 introduced an important architectural change:

  • Separated network subscriptions from local client subscriptions
  • Removed add_subscriber() call from complete_local_subscription()
  • Local updates now flow through executor's update_notifications channel

This PR (2233) builds correctly on that foundation:

Local Subscriptions (own WebSocket clients)
// crates/core/src/operations/subscribe.rs:332-356
async fn complete_local_subscription(
    op_manager: &OpManager,
    id: Transaction,
    key: ContractKey,
) -> Result<(), OpError> {
    // ✅ Correctly does NOT call add_subscriber
    // ✅ Notifies via NodeEvent::LocalSubscribeComplete
    // ✅ Follows PR #2229's architecture
    op_manager
        .notify_node_event(crate::message::NodeEvent::LocalSubscribeComplete {
            tx: id,
            key,
            subscribed: true,
        })
        .await?;
    
    op_manager.completed(id);
    Ok(())
}
Network Subscriptions (remote peers)
// crates/core/src/operations/subscribe.rs:515-518
// When handling RequestSub from REMOTE peer:
if op_manager
    .ring
    .add_subscriber(key, subscriber.clone(), None)  // ✅ Correct for network subs
    .is_err()

The distinction is correct:

  • complete_local_subscription() → Own WebSocket clients → No add_subscriber
  • RequestSub handler → Remote network peers → Still uses add_subscriber

3. New Functionality Integration

This PR adds race condition handling that complements PR #2229:

wait_for_contract_with_timeout (NEW in this PR)
// Handles race where Subscribe arrives before PUT completes
async fn wait_for_contract_with_timeout(
    op_manager: &OpManager,
    key: ContractKey,
    timeout_ms: u64,
) -> Result<bool, OpError> {
    // Fast path check → Register waiter → Double-check → Wait → Verify
    // ...
}

This works seamlessly with PR #2229's changes because:

  • It only affects the timing of when subscriptions complete
  • It doesn't change the local vs. network subscription separation
  • The notification system (wait_for_contract, notify_contract_stored) is orthogonal to fix: separate network from local subscriptions #2229's changes

4. Files Modified

Comparing the two PRs:

PR #2229 touched:

  • crates/core/src/operations/subscribe.rs (architectural change)
  • crates/core/src/operations/update.rs
  • crates/core/src/ring/seeding.rs
  • crates/core/src/contract/executor/runtime.rs
  • crates/core/src/message.rs
  • crates/core/tests/operations.rs

This PR touches:

  • crates/core/src/operations/subscribe.rs (builds on fix: separate network from local subscriptions #2229's foundation)
  • crates/core/src/operations/put.rs (notification system)
  • crates/core/src/client_events/mod.rs (subscription registration)
  • crates/core/src/node/mod.rs (StatePushed handling)
  • crates/core/src/node/op_state_manager.rs (contract waiters)
  • Test infrastructure files

Overlap: Only subscribe.rs - and the changes are complementary, not conflicting.


Verification: No Merge Conflicts

$ git status
On branch fix/2106-ensure-op-atomicity
Your branch is up to date with 'origin/fix/2106-ensure-op-atomicity'.
nothing to commit, working tree clean

The branch is clean and based on the current main (which includes #2229).


Conclusion

✅ No changes needed. This PR:

  1. Already incorporates PR fix: separate network from local subscriptions #2229 via rebase (commit 0a08ba3d)
  2. Respects the architectural separation between network and local subscriptions
  3. Adds complementary functionality (race condition handling) without conflicts
  4. No merge conflicts exist

Recommendation: This PR is ready to merge without modifications related to PR #2229.


Additional Notes

The changes in this PR actually strengthen the architectural improvements from PR #2229 by:

  • Handling the PUT-before-Subscribe race condition
  • Providing graceful fallbacks when no remote peers are available
  • Adding proper timeout handling for contract propagation

The two PRs work together as a coherent solution to subscription reliability issues.

--- | Branch: fix/2106-ensure-op-atomicity

@iduartgomez iduartgomez added this pull request to the merge queue Dec 6, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 6, 2025
@sanity sanity enabled auto-merge December 6, 2025 18:39
sanity and others added 3 commits December 6, 2025 13:17
Two separate issues:

1. test_macro_example flaky tests:
   The duplicate keypair detection used Display impl of TransportPublicKey,
   which only captured 12 bytes of DER encoding. Analysis showed:
   - First 6 bytes: DER headers (identical for all 2048-bit RSA keys)
   - Last 6 bytes: Contains exponent 65537 (identical for all keys)
   - Only ~1 byte of actual entropy

   Fix: Use full public key comparison (via Hash impl) for uniqueness
   Also improved Display impl to use bytes from the modulus (actual random data)

2. transport_perf benchmark clippy errors (from merge):
   - explicit_counter_loop: Use for loop with range instead of manual counter
   - while_let_loop: Convert loop+match to while let

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…omicity

# Conflicts:
#	crates/core/src/node/op_state_manager.rs
@sanity sanity added this pull request to the merge queue Dec 6, 2025
Merged via the queue into main with commit 9f99030 Dec 6, 2025
14 of 15 checks passed
@sanity sanity deleted the fix/2106-ensure-op-atomicity branch December 6, 2025 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PUT operations with auto-subscribe children don't deliver completion to client

4 participants