feat(ring): redesign seeding logic with LRU byte-budget cache #2232

sanity · 2025-12-06T05:26:20Z

Problem

The current seeding logic in SeedingManager::should_seed() uses ad-hoc thresholds that don't reflect principled design:

const CACHING_DISTANCE: f64 = 0.05;  // Magic number - why 5%?
const MAX_SEEDING_CONTRACTS: usize = 100;  // Arbitrary count
const MIN_SEEDING_CONTRACTS: usize = 25;   // More arbitrary counts

The three-tier decision tree treats all contracts equally regardless of size, doesn't consider actual demand, and conflates storage limits with proximity requirements.

Why This Matters

Resource unfairness: A 10MB contract consumes the same "slot" as a 1KB contract
No demand signal: Contracts persist based on distance, not whether anyone uses them
Magic thresholds: The 0.05 distance cutoff has no principled basis
Young network problem: Small networks should cache broadly, but threshold logic kicks in early

Design Discussion

This PR emerged from a design discussion (see issue #2231) about how seeding should work from first principles:

Core insight: Proximity emerges naturally from routing - peers near a contract's location see more GETs, keeping it fresh in their caches. We don't need explicit distance thresholds.

Key properties of the new design:

Byte-budget based: Storage limit is in bytes (default 100MB), not contract count
LRU eviction: Least recently accessed contracts evict first
Demand-driven: Only GET/PUT/SUBSCRIBE refresh position - not UPDATE (manipulation resistant)
Self-regulating: Busy networks evict faster, quiet networks retain longer

This PR

What's Included

New SeedingCache<T: TimeSource> struct in crates/core/src/ring/seeding_cache.rs
Byte-budget aware LRU eviction with 100MB default
AccessType enum (Get, Put, Subscribe) for manipulation-resistant refresh
Full integration with GET and PUT operations
Removal of old should_seed logic and magic thresholds
Eviction handling: subscribers are cleared when contracts are evicted
28 unit tests covering:
- Basic add/remove/contains
- LRU eviction order
- Byte budget enforcement
- Large contracts displacing multiple small ones
- Access refreshing LRU position
- Contract size change handling
- Subscriber cleanup on eviction
- Edge cases (zero budget, oversized contracts)

Integration Changes

GET operations: Now call record_get_access(key, size_bytes) instead of should_seed + seed_contract
PUT operations: Now call seed_contract(key, size_bytes) with state size
Test utilities: Updated MockRing to match new API
SeedingManager: Refactored to use internal SeedingCache instead of DashMap<ContractKey, Score>

What's Deferred to #2164

Full subscription tree pruning: When a contract is evicted, sending Unsubscribed to upstream peers requires tracking upstream->downstream relationships per contract. This is documented with a TODO comment.

Testing

All 28 seeding-related tests pass:

ring::seeding_cache::* - LRU cache behavior
ring::seeding::* - SeedingManager integration
operations::test_utils::* - Mock ring behavior

Issue Redesign seeding logic: LRU-based caching with byte budgets #2231: Design discussion and full context
Issue Wire up Meter for resource usage tracking #2230: Wiring up Meter for resource tracking (separate concern)
Issue refactor: Restructure peer identity and address handling #2164: Subscription tree branch pruning (deferred eviction handling)
PR feat: implement proximity-based update forwarding #2228: ProximityCacheManager - this redesign may simplify or replace it

[AI-assisted - Claude]

Introduce SeedingCache, a new data structure for contract caching that replaces the ad-hoc count-based and distance-threshold logic in SeedingManager. Key design principles: - Resource-aware: Uses byte budget, not contract count - Demand-driven: Evicts based on LRU access patterns - Manipulation-resistant: Only GET/PUT/SUBSCRIBE refresh position, not UPDATE (which contract creators control) This is the foundation for issue #2231. The cache is not yet integrated with the rest of the system - that will come in subsequent commits. Includes comprehensive unit tests for: - Basic add/remove/contains operations - LRU eviction order - Byte budget enforcement - Large contracts displacing multiple small ones - Access refreshing LRU position - Edge cases (zero budget, oversized contracts) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

sanity · 2025-12-06T15:58:22Z

@claude Please do a critical review of this PR

claude · 2025-12-06T15:58:36Z

Claude finished @sanity's task —— View job

Critical Review of PR #2232

Overview

This PR introduces a principled LRU byte-budget cache to replace the current ad-hoc seeding logic. The design is well-motivated and the implementation is solid as a foundation. However, there are several critical issues that need addressing before integration.

🔴 Critical Issues

1. Incorrect LRU Eviction Budget Check (crates/core/src/ring/seeding_cache.rs:117)

The eviction loop has a subtle but serious bug:

while self.current_bytes + size_bytes > self.budget_bytes && !self.lru_order.is_empty()

Problem: This allows the cache to exceed its budget when adding contracts larger than the budget.

Scenario:

Budget: 100 bytes
Current: 0 bytes
Adding: 150-byte contract
Condition: 0 + 150 > 100 → true, evicts everything
Result: Cache has 150 bytes (50% over budget!)

This is intentional per the test at line 418 (test_contract_larger_than_budget), but it's a design flaw. The cache should either:

Reject oversized contracts (return them in evicted list)
Accept them but document the budget is a soft limit
Make budget enforcement strict with a separate "max contract size" parameter

Recommendation: Option 3 is best. Add a max_contract_size parameter and reject contracts that exceed it. This prevents a single huge contract from consuming unlimited resources.

2. Zero-Budget Edge Case is Broken (crates/core/src/ring/seeding_cache.rs:390-402)

The test test_zero_budget_edge_case reveals broken behavior:

assert!(cache.is_empty() || cache.current_bytes() == 100);

This test accepts either outcome, which is a red flag. With zero budget:

The contract gets added (line 136: self.current_bytes += size_bytes)
Nothing can be evicted (the while loop never runs since we're adding to 0, and 0 + 100 > 0 is true but the list is empty)
Result: Cache has 100 bytes when budget is 0

Fix: Add an early return when size_bytes > budget_bytes:

pub fn record_access(...) -> Vec<ContractKey> {
    // Early reject for oversized contracts
    if size_bytes > self.budget_bytes {
        return vec![]; // Or return an error type
    }
    // ... rest of implementation
}

3. Missing Integration with Access Operations

The PR description states "GET/PUT/SUBSCRIBE refresh position" but there's no integration code showing where these operations call record_access. Critical questions:

Does every GET trigger record_access? (Could be expensive)
Does SUBSCRIBE add to the cache or just to subscribers?
When a contract leaves the cache, who initiates upstream unsubscription?

Recommendation: Add integration points or at least stub comments showing where the hooks will go:

// In get_op.rs (example):
// cache.record_access(key, state.len() as u64, AccessType::Get);

4. No Eviction Callback Mechanism

When contracts are evicted, the caller receives Vec<ContractKey>, but there's no mechanism to trigger cleanup:

Unsubscribe from upstream peer
Delete cached state
Notify subscribers

The current SeedingManager has both seeding_contract (what we cache) and subscribers (who we notify). This cache only tracks caching decisions but doesn't integrate with subscription management.

Recommendation: Add an eviction callback:

pub struct SeedingCache<T: TimeSource, F: FnMut(ContractKey)> {
    // ...
    on_evict: Option<F>,
}

// In eviction:
if let Some(ref mut callback) = self.on_evict {
    callback(oldest_key);
}

Or document that the caller MUST process the returned eviction list.

⚠️ Design Concerns

5. LRU Order Maintenance is O(n)

Line 112: self.lru_order.retain(|k| k != &key);

This is O(n) for every refresh. For a cache with thousands of contracts and frequent GET operations, this could become a bottleneck.

Better approach: Use a doubly-linked intrusive list or the lru crate which maintains O(1) operations. The VecDeque is only efficient for push/pop, not for arbitrary removal.

Mitigation: If keeping VecDeque, add a comment about the performance trade-off and consider benchmarking.

6. Manipulation Resistance is Incomplete

The design excludes UPDATE from refreshing cache position (good!), but there's a subtle attack:

Attack: A malicious peer could subscribe/unsubscribe repeatedly to keep a contract fresh in everyone's cache, consuming resources without actually using the contract.

Mitigation: Consider:

Rate-limiting SUBSCRIBE refresh (e.g., only refresh if >1 minute since last refresh)
Tracking unique subscribers rather than subscription count
Making eviction also consider "coldness" (time since last GET, weighted higher)

7. No Configuration Validation

The new function accepts any budget_bytes, including 0. This leads to pathological behavior. Add validation:

pub fn new(budget_bytes: u64, time_source: T) -> Result<Self, &'static str> {
    if budget_bytes == 0 {
        return Err("budget_bytes must be non-zero");
    }
    Ok(Self { /* ... */ })
}

Or at minimum document the behavior in zero/low budget scenarios.

🟡 Code Quality Issues

8. Test Coverage Gaps

The tests are comprehensive for the cache itself, but missing:

Concurrent access: The cache will be used from multiple threads. No tests verify thread safety (though the types look correct).
Integration scenarios: What happens when a contract is in both the cache AND has active subscribers?
Subscription cascade: The PR description mentions cascade unsubscription but there are no tests.
Size changes: What if a contract's state changes size after caching?

9. Type Safety: Size Can Change

Contracts can grow or shrink with updates. The cache stores size_bytes at access time, but if the contract updates, the cached size becomes stale.

Scenario:

Cache contract A (100 bytes)
Contract updates to 1MB
Cache evicts based on old 100-byte size
Budget accounting is now completely wrong

Fix: Either:

Require caller to update size on state changes: cache.update_size(key, new_size)
Make record_access always update size (requires passing size even on refresh)
Document that size is immutable per contract (seems unrealistic)

10. Unsafe Arithmetic

Line 121: self.current_bytes = self.current_bytes.saturating_sub(removed.size_bytes);

Using saturating_sub hides bugs. If current_bytes goes negative, something is seriously wrong (double-eviction, size mismatch, etc.). Better to use checked arithmetic and panic:

self.current_bytes = self.current_bytes.checked_sub(removed.size_bytes)
    .expect("BUG: current_bytes underflow - cache accounting broken");

This makes bugs visible during testing rather than silently hiding them.

🟢 What's Good

Strengths

Well-documented motivation: The PR description clearly explains WHY this is needed
Time abstraction: Using TimeSource makes tests deterministic
Comprehensive unit tests: Good coverage of core logic (modulo the gaps above)
Clear API: record_access returning evictions is a clean design
AccessType enum: Good manipulation resistance thinking
Foundation-only approach: Smart to land the cache separately before integration

📋 Recommendations for Moving Forward

Before Merging

Fix zero-budget edge case (critical)
Fix over-budget contracts (add max contract size or document soft limit)
Address LRU O(n) performance (document or fix)
Add size update mechanism (document or implement)
Validate budget in constructor (or document zero-budget behavior)

Before Integration

Add eviction callbacks (or document caller responsibility)
Show integration points (at least in comments/stubs)
Test subscription cascade (integration test in follow-up PR)
Consider manipulation resistance refinements (rate limiting)
Benchmark with realistic workloads (thousands of contracts, high GET rate)

🎯 Verdict

Direction: ✅ Yes, this is the right approach. Replacing magic numbers with principled LRU+byte-budget is a clear improvement.

Implementation: ⚠️ Needs work before merging. The critical bugs (zero budget, over-budget) must be fixed. The integration story needs clarity.

Recommendation:

Fix critical issues 1-2
Address design concerns 5, 7, 9 (at minimum document them)
Add TODO comments showing integration points
Merge as foundation
Follow up with integration PR that includes cascade tests

The design philosophy is sound. The execution just needs refinement before this becomes production-ready.

Copilot

Pull request overview

This PR introduces a new LRU-based seeding cache with byte-budget awareness as a foundation for replacing the current ad-hoc seeding logic. The design shifts from magic-number thresholds to a principled, demand-driven approach where storage limits are byte-based and contracts are retained based on actual access patterns (GET/PUT/SUBSCRIBE).

Key Changes

Implements SeedingCache<T: TimeSource> with byte-budget aware LRU eviction
Introduces AccessType enum to distinguish manipulation-resistant operations (GET/PUT/SUBSCRIBE) from UPDATE operations
Provides comprehensive test coverage (12 unit tests) for cache operations, eviction logic, and edge cases

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
crates/core/src/ring/seeding_cache.rs	New module implementing byte-budget LRU cache with access tracking, eviction logic, and comprehensive unit tests
crates/core/src/ring/mod.rs	Adds seeding_cache module to the ring module

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-06T16:01:21Z