Skip to content

feat: dutydb#415

Open
mskrzypkows wants to merge 36 commits into
mainfrom
dutydb
Open

feat: dutydb#415
mskrzypkows wants to merge 36 commits into
mainfrom
dutydb

Conversation

@mskrzypkows
Copy link
Copy Markdown
Contributor

No description provided.

@mskrzypkows
Copy link
Copy Markdown
Contributor Author

@claude review

@claude
Copy link
Copy Markdown

claude Bot commented May 14, 2026

Claude finished @mskrzypkows's task in 4m 38s —— View job


Review — Pluto DutyDB (PR #415)

Overall: a solid, readable port of Charon's core/dutydb/memory.go. The high‑level structure, error variants and field layout match the Go source well, and the Notify+RwLock swap for Go's Mutex+query‑queue model is a sensible Rust idiom that I'd lean into. A few correctness / parity points are worth addressing before merging, and test coverage has some real gaps versus the Go suite. Comments grouped by severity.


1. Correctness / parity with Charon

(a) deadliner.add is awaited while holding the state write lock
memory.rs:219-224

pub async fn store(&self, duty: Duty, unsigned_set: UnsignedDataSet) -> Result<()> {
    let mut state = self.state.write().await;
    if !self.deadliner.add(duty.clone()).await {
        return Err(Error::ExpiredDuty);
    }

Go's MemDB.Store calls db.deadliner.Add(duty) synchronously under a sync.Mutex, so the lock is held for a non‑blocking instant. In Rust, Deadliner::add is async fn and the concrete Deadliner (crates/core/src/deadline.rs:258) does its own locking and channel send (tx.send(...).await). Holding the dutydb write lock across that .await blocks all readers/writers of the DutyDB on the deadliner's progress, which is a deadlock/contention foot‑gun (and worse than Go).

Recommend either:

  • call deadliner.add(duty.clone()).await before acquiring the write lock, then acquire and proceed; or
  • make Deadliner::add non‑async (Go's signature is sync).

(b) Dead clash check in store_agg_attestation
memory.rs:510-527

The key includes the tree‑hash root of the att data, and the "clash" check compares existing_data.tree_hash_root().0 != root. Because we just looked the entry up by that root, the two sides are always equal, so ClashingDataRoot is unreachable. This faithfully mirrors a bug already in memory.go:458-460, but worth flagging: this branch never fires and should either (i) be removed, or (ii) the key changed (e.g. slot only) so a real clash is detectable. Either is a behavior change vs. Go and warrants a separate decision.

(c) Aggregated attestation duty is always re‑inserted even when already present
memory.rs:521-524

} else {
    self.agg_keys_by_slot.entry(slot).or_default().push(key);
}
self.agg_duties.insert(key, agg.clone());

Matches Go (memory.go:462), but the .clone() on every duplicate store is wasted work — consider moving the insert into the else branch if Go's overwrite is not actually load‑bearing.

(d) await_* have no per‑call cancellation
memory.rs:289-307

Go's AwaitProposal(ctx, slot) returns ctx.Err() if the caller cancels. The Rust versions only honor the DB‑global CancellationToken. Callers can tokio::time::timeout(...) or abort the task to drop the future, but the API surface is narrower than Go's. I'd add a CancellationToken (or a tokio::time::Duration timeout) parameter to bring parity — otherwise it's easy for higher layers to leak tasks waiting forever.

(e) Expired duties are not drained when store returns early with an error
memory.rs:226-283

Same as Go (early return err in the switch skips the drain loop), so this is parity, but worth noting because a transient clash error will defer expiry processing until the next successful store. Not blocking.

(f) BuilderProposer and other unsupported duty types are still registered with the deadliner

deadliner.add(duty) runs before the switch, so a duty whose type later returns DeprecatedDutyBuilderProposer / UnsupportedDutyType ends up tracked by the deadliner and silently expires (delete_duty rejects unknown types again). Again, this is parity with Go, but it's a minor wart that has been carried over.


2. Concurrency model

(a) await_data thundering herd. memory.rs:341-367

notify_waiters() wakes every waiter on the channel, regardless of whether their specific key was the one stored. Each woken task then takes the read lock and does a hashmap lookup. For a cluster with many concurrent await_attestation calls in flight (one per committee), every Store will wake them all. Go's *Queries slice resolves only the entry whose key matches, with no spurious wakeups. Probably fine in practice given the scale, but worth being aware of — if it shows up as latency, a per‑key channel or a tokio::sync::watch keyed map is a clean fix.

(b) Notify::notified() registration pattern is correct. The pin! + enable() + lookup + select! ordering avoids the classic missed‑wakeup race. Nice.

(c) State.deadliner_rx: Option<...> is held inside State so that try_recv happens under the write lock. The Option/take() workaround stems from Deadliner::c() returning the receiver only once. It would read more naturally as a dedicated Mutex<Option<Receiver<Duty>>> field on MemDB (or simply on State) — keeping it in State works but the optional is awkward.

(d) MemDB::shutdown(&self) is idempotent via CancellationToken, which is an improvement over Go's "may only be called once" requirement. Worth mentioning in the doc comment.


3. API / style

  • Naming collision. pluto_core::dutydb::MemDB and pluto_core::parsigdb::memory::MemDB are both reachable. Not a compile error, but consider DutyMemDB / ParSigMemDB to make imports unambiguous.
  • State methods are sync but mutate &mut self — they implicitly assume the caller holds the write lock. Go names them *Unsafe for exactly this reason; a one‑line // caller holds the write lock doc on each method (or splitting State into its own private module) would help future readers.
  • Doc comment on the commIdx=0 compatibility block memory.rs:466-468 is shorter than Go's. The Go version explains why both index variants are stored (post‑Electra default + buggy VCs); the Rust comment is a one‑liner. Worth porting the full rationale — this code looks pointless without the context.
  • UnsignedDutyData::Proposal(Box<VersionedProposal>) — good call to box the variant; the other variants are much smaller.
  • store_agg_attestation / store_sync_contribution differ in shape: agg writes to agg_keys_by_slot inside the else and to agg_duties outside; contrib writes both inside the else. Pick one style — the contrib version is the safer one (no overwrite when content matches by key).

4. Tests

The new suite covers shutdown, attester happy path, unsupported duty types, proposer, sync contribution, clashes for sync/blocks/proposer, and expiry. Compared to charon/core/dutydb/memory_test.go, the following are missing — I'd add at least the first three before merge:

  • Aggregator end‑to‑end test (Go's TestMemDBAggregator) — await_agg_attestation and store_agg_attestation are completely uncovered.
  • ClashingPublicKey — two different pubkeys for the same (slot, commIdx, valIdx).
  • ClashingAttestationData (and the commIdx=0 source / target variants) — different phase0::AttestationData for the same (slot, commIdx).
  • Await blocks then resolves: the mem_db_proposer test does this for proposals; consider replicating for attester / aggregator / sync.
  • Shutdown wakes an already‑waiting await_* — current tests call shutdown before the await; the more interesting case is a waiter parked in select!.
  • The clashing_sync_contributions test mutates signature to force a different tree‑hash root — it would be more explicit to construct two semantically distinct contributions, since "signature affects root" is only true today and depends on the SSZ schema.

The TestDeadliner helper uses std::sync::Mutex inside async fn add (memory.rs:651-655). The critical section is trivially short so this is fine, but parking_lot::Mutex or restructuring would make it impossible to misuse later.


5. Nits

  • memory.rs:155-170: missing doc comment on State.
  • memory.rs:172-184: the MemDB doc could explicitly say "functionally equivalent to charon/core/dutydb.MemDB" — there's a comment in the file header but not on the type.
  • memory.rs:104-105: consider re‑exporting Result privately; many downstream callers will conflict with std::result::Result if they use pluto_core::dutydb::*.
  • memory.rs:530-552: the temporary binding let inner = &contrib.0; could go away if SyncContribution had a Deref or a data() accessor — minor.

Summary

  • One real concern: deadliner.add is awaited under the dutydb write lock (1a) — please move it outside or make it sync.
  • Behavior matches Go closely; the small parity surprises (unreachable agg clash check, missing per‑call cancellation, deadliner add on rejected duties) are inherited from Charon — flag them in the PR description so we don't pretend they were intentional Rust choices.
  • Add aggregator + clash tests to bring the suite in line with Go.

Nothing here blocks merge other than (1a) and the missing tests.

  • Read PR diff and changed files
  • Cross-reference Charon Go source (dutydb)
  • Functional correctness review (parity with Go)
  • Concurrency/locking review
  • Rust style / API surface review
  • Test quality review
  • Post final review
    • Branch: dutydb

@mskrzypkows mskrzypkows changed the title Dutydb feat: dutydb May 14, 2026
@varex83 varex83 linked an issue May 15, 2026 that may be closed by this pull request
Copy link
Copy Markdown
Collaborator

@varex83agent varex83agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test

Copy link
Copy Markdown
Collaborator

@varex83agent varex83agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t

Copy link
Copy Markdown
Collaborator

@varex83agent varex83agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t

Copy link
Copy Markdown
Collaborator

@varex83agent varex83agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t

Comment thread crates/core/src/dutydb/memory.rs Outdated
Copy link
Copy Markdown
Collaborator

@varex83agent varex83agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t

Comment thread crates/core/src/dutydb/memory.rs Outdated
impl MemDB {
/// Creates a new in-memory DutyDB.
pub fn new(deadliner: Arc<dyn Deadliner>, cancel: CancellationToken) -> Self {
let deadliner_rx = deadliner.c();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major — silent memory leak if the deadliner channel has already been consumed.

let deadliner_rx = deadliner.c();

Deadliner::c() documents itself as "may only be called once and returns None on subsequent calls" (deadline.rs:114-118). If a caller wires the deadliner into two components (or some intermediate already took the receiver), deadliner_rx is silently None and the drain loop at memory.rs:276-285 short-circuits via None => break on every store(). Expired duties never get cleaned up; the in-memory maps grow without bound for the lifetime of the process — no error, no log.

Recommend one of:

  • Make MemDB::new fallible: pub fn new(...) -> Result<Self> and return Err when c() returns None.
  • Take the receiver directly: pub fn new(deadliner, deadliner_rx, cancel) so the wiring is statically obvious.
  • At minimum, panic with a descriptive message.

Note: the test fixture NoopDeadliner (line 614-621) intentionally returns None, but it also never Adds anything to a queue, so the no-cleanup is safe for tests. Real wiring must yield Some(rx) exactly once.

Copy link
Copy Markdown
Collaborator

@varex83agent varex83agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t

root: phase0::Root,
}

struct State {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major — no upper bound on stored entries; resource-exhaustion vector.

State has eight maps (att_duties, att_pub_keys, att_keys_by_slot, pro_duties, agg_duties, agg_keys_by_slot, contrib_duties, contrib_keys_by_slot) whose only cleanup path is the deadliner drain at memory.rs:276-285. There is no cap on (a) how many slots can be tracked simultaneously, (b) how many committees/validators per slot, or (c) how far in the future a slot can be.

With u64 slot space, a peer who can inject duties into the consensus pipeline that feeds store() can submit duties for far-future slots whose deadlines (genesis_time + slot*slot_duration + ...) are far in the future, pinning entries for arbitrary durations. Combined with the deadliner mpsc full-channel-drops behaviour (see other finding), even near-term entries can leak.

Recommend:

  1. A configurable cap on entries per map, with Err at store() time when exceeded.
  2. A maximum permitted (slot - current_head_slot) delta enforced at store().
  3. Sanity bounds on AttesterDuty.committee_length and committees_at_slot (signeddata.rs:1124-1133) which are currently stored verbatim with no validation.

Copy link
Copy Markdown
Collaborator

@varex83agent varex83agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t

Comment thread crates/core/src/dutydb/memory.rs Outdated
}

// Drain all expired duties that the deadliner has sent.
loop {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major — deadliner full-channel drops cause permanent map leaks.

Cleanup of *_duties / *_keys_by_slot only runs inside this drain loop, consuming from deadliner_rx. The deadliner's output channel is a bounded mpsc of capacity 256 (deadline.rs:421-426). When full, deadline.rs:368-380 does a try_send whose Full arm logs a tracing::warn! and drops the expiry event. The deadliner also removes the duty from its internal HashSet (deadline.rs:383), so the expiry is permanently lost — never retried, never delivered.

Result: the matching entries in att_duties, att_pub_keys, att_keys_by_slot, pro_duties, agg_duties, agg_keys_by_slot, contrib_duties, contrib_keys_by_slot are never cleaned up. They persist until process restart.

Triggers: many duties expiring near-simultaneously across types (a peer who submits duties for hundreds of distinct slots × 4 duty types can produce > 256 expiries in a burst), or any store() call blocking long enough on the write lock for the channel to back up.

Fixes to consider in this PR:

  • Have MemDB run a dedicated background task that consumes from deadliner_rx independently of store(), so drain rate doesn't depend on call rate.
  • Replace try_send in the deadliner with a bounded blocking send (with a watchdog for slow consumers).

This affects the upstream deadliner too, so it may warrant a separate ticket; flag it here because the consequence shows up in DutyDB.

Copy link
Copy Markdown
Collaborator

@varex83agent varex83agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t

Comment thread crates/core/src/dutydb/memory.rs Outdated
Copy link
Copy Markdown
Collaborator

@varex83agent varex83agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t

},
None => break,
};
state.delete_duty(expired)?;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store() propagates delete_duty errors that are unrelated to the current store.

loop {
    let expired = match state.deadliner_rx {
        Some(ref mut rx) => match rx.try_recv() {
            Ok(d) => d,
            Err(_) => break,
        },
        None => break,
    };
    state.delete_duty(expired)?;
}

try_recv() removes the duty from the channel before delete_duty is called. If delete_duty returns Err(Error::DeprecatedDutyBuilderProposer) (line 563) or Err(Error::UnknownDutyType) (line 589), the loop terminates with that error propagated to the caller of store(). From the caller's perspective the current store failed — but storage already succeeded — and the expired duty has been silently dropped from the channel without its maps being cleaned up.

Go has the same shape (memory.go:130-147), so this is inherited; but the Rust port can harden it cheaply:

  1. After the successful storage above, do not bubble drain-loop errors to the store() caller — tracing::error! them instead, since the current store has nothing to do with the failure.
  2. The two error variants reachable here are both bugs (BuilderProposer is rejected at line 220 before any deadliner add; UnknownDutyType is unconstructable from Duty::new callers) — error!-and-continue is the right move.

Also minor: Err(_) collapses TryRecvError::Empty (normal break) and TryRecvError::Disconnected (deadliner task died, abnormal). Consider distinguishing them and logging Disconnected at error!.

Copy link
Copy Markdown
Collaborator

@varex83agent varex83agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t

return Err(Error::DeprecatedDutyBuilderProposer);
}

if !self.deadliner.add(duty.clone()).await {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deadliner.add() runs outside the state lock — race past deadline before insert.

if !self.deadliner.add(duty.clone()).await {
    return Err(Error::ExpiredDuty);
}
let mut state = self.state.write().await;

Go (memory.go:73-77) holds the mutex across deadliner.Add(duty) and the subsequent storage. Rust calls deadliner.add().await before acquiring the state write lock.

For the in-tree DeadlinerImpl and the test fakes this is benign (add is idempotent and concurrency-safe). The real risk: if add() succeeds and the .await between these two lines exceeds the duty's remaining lifetime (mainnet margin is slot_duration/12 ≈ 1s), the deadliner can schedule and fire an expiry before the data has been inserted. The drain loop will then no-op (HashMap::remove on absent keys is fine), then the subsequent insert under the write lock will store data that is already past its deadline, with no further cleanup until another store() covers that slot.

Not directly exploitable to cause memory growth (one duty leaked until the next call covering that slot), but it does widen the leak window from the deadliner-full finding. Either acquire the write lock first and call add() inside the critical section (matches Go), or document the intentional deviation in a source comment.

Copy link
Copy Markdown
Collaborator

@varex83agent varex83agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t

Comment thread crates/core/src/dutydb/memory.rs Outdated
Some(UnsignedDutyData::Proposal(p)) => state.store_proposal(p)?,
Some(_) => return Err(Error::InvalidVersionedProposal),
}
self.pro_notify.notify_waiters();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notify_waiters thundering herd — O(N) wakeups per store vs Go's O(M) per matching query.

store() calls self.X_notify.notify_waiters() on each successful arm (lines 240, 250, 260, 270). All readers blocked on the same notify type wake up, race for the read lock, re-run their lookup closure, and only the matching one returns; the rest re-register on a fresh notified() and re-sleep.

Go (memory.go:551-635 resolve*QueriesUnsafe) keeps one query record per pending await and only resolves queries whose key actually matches the newly stored item — O(M) per store where M is the number of pending queries.

For a validator running many concurrent attesters/aggregators this is a real amplification on every store(). Not a correctness bug, but a measurable efficiency regression vs the Go design. Long-term fix: keyed notify (HashMap<Key, Vec<oneshot::Sender<V>>>) or match Go's per-query record + resolver pattern. At minimum, note the design choice in a source comment so a future optimisation pass knows what to revisit.

@mskrzypkows mskrzypkows marked this pull request as ready for review May 15, 2026 15:13
@mskrzypkows
Copy link
Copy Markdown
Contributor Author

@varex83agent review again

Comment thread crates/core/src/dutydb/memory.rs
Comment thread crates/core/src/dutydb/memory.rs
if let Some(existing) = self.att_duties.get(&att_key) {
if existing.source != data.source
|| existing.target != data.target
|| existing.beacon_block_root != data.beacon_block_root
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing data.index and data.slot

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's missing because we create att_key from slot and index, so there is no reason to compare it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the key is from committee_index, that's no guarantee on att.data.index == att.duty.committee_index right?

Comment thread crates/core/src/signeddata.rs
Comment thread crates/core/src/dutydb/memory.rs Outdated
Comment thread crates/core/src/dutydb/memory.rs Outdated
Comment thread crates/core/src/dutydb/memory.rs Outdated
Comment thread crates/core/src/dutydb/mod.rs Outdated
Copy link
Copy Markdown
Collaborator

@varex83agent varex83agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid first pass at porting core/dutydb/memory.go. The functional behaviour matches Charon for the happy paths and the test coverage hits most of the Go scenarios. Three findings push this to request-changes:

  1. store() holds the write lock across deadliner.add().await — a multi-await section under tokio RwLock that blocks all readers/writers behind a channel round-trip to the deadliner background task.
  2. Drop for MemDB cancels a caller-supplied CancellationToken — diverges from Go's Shutdown semantics and will propagate cancellation to every sibling subsystem that shares the token.
  3. store_agg_attestation unconditionally overwrites the stored VersionedAggregatedAttestation even though only the AttestationData root is in the key — fields like validator_index and the signed payload variant can differ between callers with the same root, and last-write-wins. Combined with the dead Error::ClashingDataRoot variant this is a behaviour change relative to Charon.

Minor parity gaps: await_agg_attestation drops the slot parameter from the Go API, await_* methods have no per-call cancellation (Go takes ctx.Context), AggKey drops the slot field. Several test assertions use fragile err.to_string().contains(...) instead of matches!.

Nothing here is a slashing-risk regression (the commIdx=0 compat path's source/target-only check is the slashing-safe choice). Verdict: REQUEST_CHANGES on the three majors; the rest are minor/nit comments left for author discretion.


let mut state = self.state.write().await;

if !self.deadliner.add(duty.clone()).await {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write lock held across deadliner.add().await — readers and writers are starved on every store.

Line 258 takes self.state.write().await. Line 260 then awaits self.deadliner.add(duty.clone()).await. DeadlinerImpl::add (crates/core/src/deadline.rs:258-274) does an mpsc::Sender::send().await plus a oneshot::Receiver::await against a single background task — that's two awaits while the write lock is held. Under tokio's writer-preferring RwLock this blocks ALL concurrent await_proposal / await_attestation / pub_key_by_attestation / other store calls until the deadliner task drains its mpsc input and replies.

Charon's deadliner.Add (charon/core/dutydb/memory.go:76) is synchronous and non-blocking, so holding the equivalent db.mu.Lock() across it is cheap. The Rust port turns a fast critical section into a multi-hop async operation under the lock.

Fix: call the deadliner before acquiring the write lock —

if !self.deadliner.add(duty.clone()).await {
    return Err(Error::ExpiredDuty);
}
let mut state = self.state.write().await;

This preserves parity with Go (Add then mutate) and keeps the in-memory mutation in a fast non-awaiting critical section.


impl Drop for MemDB {
fn drop(&mut self) {
self.cancel.cancel();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop cancels a caller-supplied CancellationToken — propagates beyond the dutydb.

MemDB::new accepts cancel: CancellationToken from the caller (line 219). The whole point of CancellationToken is shareability — that token is almost always a clone of an app-wide shutdown token wired to the deadliner background task and potentially other subsystems (p2p, relay, …). Drop unconditionally calls self.cancel.cancel() on line 424.

If a MemDB is dropped before the explicit shutdown — construction error in a sibling, panic during init, replaced in a test fixture, error path that unwinds the stack — this Drop kills every component that shares this token, including the deadliner the MemDB references.

Charon's Shutdown only closes an internal chan struct{} (charon/core/dutydb/memory.go:67-69); dropping the DB cannot affect external components.

Pick one:

  1. Take the parent CancellationToken and store cancel.child_token() internally — Drop cancels only the child.
  2. Remove Drop and rely on the explicit shutdown() (whose doc already implies it is the lifecycle hook).
  3. Document the contract on new if option 2 is undesirable.

Option 1 is the most defensive and matches the pattern used elsewhere in the codebase.

Comment thread crates/core/src/dutydb/memory.rs Outdated
}
// we don't check existingDataRoot != providedDataRoot because these values
// comes from the same source and the error was unreachable
self.agg_duties.insert(key, agg.clone()); // unconditional overwrite
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unconditional overwrite + dead Error::ClashingDataRoot — Go-divergent for fields that aren't in the tree-hash key.

store_agg_attestation keys by att_data.tree_hash_root() (line 580/584) but the stored value is the full VersionedAggregatedAttestation, which carries validator_index and an AttestationPayload variant (Phase0/Altair/.../Fulu). Neither field is part of the AttestationData tree-hash root, so two callers with the same root but a different validator_index (or a different payload variant — e.g. one producer emits an Electra payload, another a Fulu) will silently overwrite each other; the last store_agg_attestation wins for every awaiter on that root.

The comment at lines 588-589 ("these values come from the same source and the error was unreachable") justifies removing the clash check, but the justification only holds for the root-only equality — not for the payload fields stored alongside.

Charon (charon/core/dutydb/memory.go:435-466) computes providedData.HashTreeRoot() and existingData.HashTreeRoot() and returns errors.New("clashing data root") if they differ. The Rust port has the analogous variant Error::ClashingDataRoot (memory.rs:91) but it is never constructed anywhere in the workspace — dead code.

Please either (a) restore an equality check against the full stored payload (signatures included) so two callers that disagree see an error, or (b) document why payload differences are guaranteed never to occur in Pluto's producer wiring AND remove Error::ClashingDataRoot to avoid a dead public variant.

Comment thread crates/core/src/dutydb/memory.rs Outdated
Comment thread crates/core/src/dutydb/memory.rs
};

/// Deadliner that always accepts duties and never expires them.
pub(crate) struct NoopDeadliner;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: pub(crate) on NoopDeadliner (line 681), TestDeadliner (line 697), and its methods (lines 704, 714) is redundant — they live inside a private mod tests, so nothing outside this file's test module can reach them anyway. Drop the modifiers.

Comment thread crates/core/src/dutydb/memory.rs Outdated
/// Stores unsigned duty data for the given duty, waking any pending
/// waiters.
pub async fn store(&self, duty: Duty, unsigned_set: UnsignedDataSet) -> Result<()> {
if duty.duty_type == DutyType::BuilderProposer {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: BuilderProposer short-circuits BEFORE deadliner.add (line 255-256), but Charon's Store hits case core.DutyBuilderProposer: return errDeprecated (charon/core/dutydb/memory.go:95-96) AFTER deadliner.Add(duty) on line 76. The duty type is deprecated so this rarely matters, but it is a documented order-of-operations divergence. Either move the check inside the match arm after the deadliner call (to match Go) or leave a comment that the pre-empt is intentional.

Comment thread crates/core/src/dutydb/memory.rs Outdated

/// Versioned aggregated attestation (unsigned).
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct VersionedAggregatedAttestation(pub versioned::VersionedAttestation);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: VersionedAggregatedAttestation(pub versioned::VersionedAttestation) and SyncContribution(pub altair::SyncCommitteeContribution) (line 1160) both expose their inner field as pub, so callers reach in via .0 (e.g. contrib2.0.aggregation_bits = ... in the dutydb tests). Public fields on newtype wrappers defeat the wrapper's purpose. Either drop the wrappers and use the inner type with a pub type alias, or keep them and expose pub fn inner(&self) -> &T / into_inner(self) -> T with the field private.

Comment thread crates/core/src/dutydb/memory.rs Outdated
Comment thread crates/core/src/dutydb/memory.rs Outdated
Copy link
Copy Markdown
Collaborator

@iamquang95 iamquang95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other part lgtm


impl MemDB {
/// Creates a new in-memory DutyDB.
/// cancel: cancellation token that shuts down the DB, it should be a child
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we store cancel.child_token instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes seems good idea, it will guarantee that we won't cancel global one.

if let Some(existing) = self.att_duties.get(&att_key) {
if existing.source != data.source
|| existing.target != data.target
|| existing.beacon_block_root != data.beacon_block_root
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the key is from committee_index, that's no guarantee on att.data.index == att.duty.committee_index right?

Copy link
Copy Markdown
Collaborator

@varex83agent varex83agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third-pass multi-agent review of #415. The PR already has ~60 comments from two prior /review-pr runs and three human reviewers, so this pass is filtered to novel findings only — no duplicates of issues already raised. All findings here are minor- or nit-level; the prior CHANGES_REQUESTED verdicts already cover the bug-level items.

Themes in this pass:

  1. Cancellation gap on the reader sidestate.read().await parks readers with no shutdown escape if a writer hangs (compounds with the already-flagged writer-side lock-across-await issue).
  2. Spurious notify_waiters cascades — empty unsigned_sets still fire wakeups and consume deadliner slots; tighten with an early-return.
  3. Error variant overloadingError::InvalidAggregatedAttestation and SignedDataError::Custom each conflate distinct failure modes, hurting operator triage and source-chain visibility.
  4. Silent overwrite of wrapper fields in store_agg_attestation (aggregation_bits, signature, validator_index, payload variant) — only the AttestationData tree-hash root is checked.
  5. Compat namespace asymmetry — the relaxed clash check in store_att_compat_commidx0 lets a compat-mirror entry poison the real committee_index=0 slot.
  6. Test fixture monoculture — every aggregator/proposer test uses Phase0 payloads, leaving post-merge match arms (Bellatrix/Deneb/Electra/Fulu) dead in CI.

Verdict: COMMENT (no new bug/major items above what's already been raised — leaving the prior CHANGES_REQUESTED standing).

tokio::pin!(notified);

{
let state = self.state.read().await;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reader-side shutdown DoS: state.read().await ignores self.cancel.

Both await_data (here) and pub_key_by_attestation (line 426) park unconditionally on self.state.read().await before entering any select! on self.cancel. If a writer holds the write lock for an extended period — most plausibly via the already-flagged deadliner.add().await pattern at lines 273-275 — every reader is stuck on the lock with no shutdown escape.

When shutdown() is called, cancel.cancel() wakes futures inside the tokio::select! on line 410, but not futures still parked on RwLock::read().await. tokio's RwLock is not cancellation-driven; it only wakes on lock release.

The shutdown contract documented at lines 259-260 ("signalling all current and future await_* calls to return Error::Shutdown") is violated whenever a writer hangs. The underlying writer-side fix (release the lock before any .await) is the right long-term solution, but defending the reader independently is cheap:

let state = tokio::select! {
    biased;
    _ = self.cancel.cancelled() => return Err(Error::Shutdown),
    s = self.state.read() => s,
};

Apply uniformly to await_data and pub_key_by_attestation. Defense-in-depth that survives any future writer-side regression.

Some(UnsignedDutyData::Proposal(p)) => state.store_proposal(p)?,
Some(_) => return Err(Error::InvalidVersionedProposal),
}
self.proposer_notify.notify_waiters();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notify_waiters fires even when nothing was inserted — wakes everyone for a no-op store.

When unsigned_set is empty for DutyType::Proposer, the match at line 284 takes the None arm and skips storage, but self.proposer_notify.notify_waiters() still runs here. All parked await_proposal futures wake, re-acquire the read lock, find nothing, re-park. The same shape applies to Attester/Aggregator/SyncContribution arms (lines 299, 309, 319) — empty unsigned_set means the for loop body never runs but the notify still fires.

Also: the empty-set branch silently consumes a deadliner slot at line 275 for a duty that contributes no data, which compounds the already-flagged deadliner-full leak.

Cheapest fix: early-return at the top of store() with if unsigned_set.is_empty() { return Ok(()); } before deadliner.add(). Or, less invasive, gate each notify on a bool changed tracked inside the arm. Either avoids cascading wake/re-park cycles on no-op stores (which happen in retry paths).

}

fn store_agg_attestation(&mut self, agg: &VersionedAggregatedAttestation) -> Result<()> {
let att_data = agg.data().ok_or(Error::InvalidAggregatedAttestation)?;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error::InvalidAggregatedAttestation overloads two unrelated failure modes.

Line 615 returns Error::InvalidAggregatedAttestation when agg.data() returns None — the inner versioned::VersionedAttestation.attestation field is absent (per signeddata.rs:174-176). But the same variant is returned at line 305 in store() for a different reason: the outer UnsignedDutyData variant was wrong.

Go's analogous failure mode (memory.go:413-415) covers only the wrapper-type mismatch — the eth2 client lib's VersionedAttestation has a recognised Version always, so missing-payload isn't an issue there. The Pluto wrapper allows attestation: None, which is a new failure mode without a Go analogue.

Operators reading logs or pattern-matching on Error can't tell the two cases apart. Split: Error::AggregatedAttestationMissingPayload for the line-615 case, or — if missing-payload is a valid "nothing to store" sentinel — early-return Ok(()) instead of erroring.

}
// we don't check existingDataRoot != providedDataRoot because these values
// come from the same source and the error was unreachable
self.aggregation_duties.insert(key, agg.clone()); // unconditional overwrite
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store_agg_attestation silently overwrites wrapper fields that aren't in the tree-hash root.

The comment at lines 627-628 ("values come from the same source") is correct only for AttestationData. The unconditional insert here replaces fields not covered by AttestationData::tree_hash_root():

  • aggregation_bits (different aggregators may include different bits)
  • signature (each aggregator produces its own)
  • validator_index in the outer versioned::VersionedAttestation
  • The AttestationPayload enum variant (Phase0 vs Altair vs Deneb vs Electra vs Fulu — same AttestationData root could be wrapped in different variants at a fork boundary)

Go (memory.go:462) also overwrites unconditionally, but Go's stored type is the eth2 client lib's wrapper which is largely stateless. A waiter blocked on a root could be resolved by an aggregator whose attestation has different aggregation_bits than what the caller expected.

Low-effort fix: add a doc comment on store_agg_attestation documenting that ONLY the AttestationData tree-hash root is checked. Better: tracing::warn! when overwrite changes wrapper fields (especially signature or aggregation_bits).


match duty.duty_type {
DutyType::Proposer => {
if unsigned_set.len() > 1 {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-flight validation runs after acquiring the write lock and after deadliner.add() — cheap latency win to hoist it.

Line 273 grabs self.state.write().await, line 275 awaits self.deadliner.add(...) (multi-hop async against a background task), and only then line 281 validates unsigned_set.len() > 1. The length check is a pure caller-side invariant — no state required. Same applies to:

  • the wildcard _ => return Err(Error::UnsupportedDutyType) at line 321 (rejects 9 reachable variants per the mem_db_store_unsupported test)
  • the BuilderProposer arm already pre-empted at lines 269-271

Go (memory.go:82-84) also performs the check inside the switch, but Go's deadliner.Add is synchronous; the cost is trivial there. In Rust the deadliner round-trip can block the write-lock holder for tokio-scheduler-noticeable time, and unsupported duty types pollute the deadliner queue (their later expiry hits delete_duty's _ => return Err(Error::UnknownDutyType) arm and contaminates the next successful store via the drain loop on line 326 — distinct from but compounded by the already-flagged delete_duty propagation finding).

Hoist all pre-flight validations (unsupported types, len > 1, BuilderProposer) above line 273 so invalid requests never grab the write lock or the deadliner.

Test gap: mem_db_store_unsupported (line 911-942) uses NoopDeadliner, which discards everything. A TestDeadliner + expire() test would reproduce the cross-store contamination.

slot,
committee_index: 0,
};
if let Some(existing) = self.attestation_duties.get(&att_key0) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store_att_compat_commidx0 shares attestation_duties's namespace with the relaxed clash check, letting a compat-mirror entry poison the real committee_index=0 slot.

store_att_data (line 521-535) rejects source/target/beacon_block_root mismatches. store_att_compat_commidx0 only compares source and target (by design, per the comment at lines 544-547). The asymmetry changes the meaning of clashes on the (slot, comm_idx=0) namespace because both paths write to the same map.

Concrete failure mode:

  1. Validator V at slot S attests with duty.committee_index=5, beacon_block_root=A. The compat path inserts AttKey{S, 0} → data_A with no clash.
  2. A second VC (or the same VC retrying with an updated head) then stores genuinely at committee_index=0 with beacon_block_root=B (same source/target — head shifted by one block during a late-slot store, common in practice).
  3. store_att_data(S, 0, data_B) now fails with Error::ClashingAttestationData because existing.beacon_block_root != data.beacon_block_root.

The compat entry has poisoned the legitimate comm_idx=0 slot. Go has the same shape (memory.go:347-403), but the Rust port can harden it cheaply: store compat entries in a separate map (e.g. att_compat_commidx0: HashMap<(slot, val_idx), AttestationData>) instead of sharing attestation_duties. await_attestation(slot, 0) then falls back to the compat map only on a miss in the real map. The shared-namespace design conflates 'real comm=0 attestation' with 'compat mirror' — they should be logically distinct.


// Unlike Go implementation, we key by root only, slot field is redundant.
let key = AggKey { root };
if !self.aggregation_duties.contains_key(&key) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: contains_key + unconditional insert triggers clippy::map_entry and double-hashes key.

The intent — gate only the secondary aggregation_keys_by_slot.push on absence while overwriting the primary map unconditionally — doesn't fit the standard Entry API. Rewrite to hash once and document the intent locally:

let first_seen = !self.aggregation_duties.contains_key(&key);
self.aggregation_duties.insert(key, agg.clone());
if first_seen {
    self.aggregation_keys_by_slot.entry(slot).or_default().push(key);
}

Same two map ops, but the variable name (first_seen) documents why the secondary push is conditional.

true
}

fn c(&self) -> Option<Receiver<Duty>> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: NoopDeadliner::c() violates the documented Deadliner::c() contract ("may only be called once and returns None on subsequent calls", deadline.rs:114-118).

fn c(&self) -> Option<Receiver<Duty>> {
    let (_, rx) = channel(1);
    Some(rx)
}

This returns a fresh Some(rx) on every call (with the sender dropped immediately, so the receiver yields Disconnected straight away). A fake that lies about the contract masks real misuse — if MemDB::new is ever rewritten to assume c() returns None on the second call, the tests will keep passing while production breaks.

Fix: track the taken-state like TestDeadliner does (rx: Mutex<Option<Receiver<Duty>>> + .lock().unwrap().take()), or have a single shared (tx, rx) pair owned by the fake.

slot: u64,
committee_index: u64,
validator_index: u64,
) -> Result<PubKey> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: PkKey field list appears twice in pub_key_by_attestation.

state
    .attestation_pub_keys
    .get(&PkKey { slot, committee_index, validator_index })
    .copied()
    .ok_or(Error::PubKeyNotFound { slot, committee_index, validator_index })

Bind once at the top — matches the let key = AttKey { ... } / let key = ContribKey { ... } shape already used in the other await_* methods (lines 346, 360, 377):

let key = PkKey { slot, committee_index, validator_index };
let state = self.state.read().await;
state
    .attestation_pub_keys
    .get(&key)
    .copied()
    .ok_or(Error::PubKeyNotFound { slot, committee_index, validator_index })

const COMM_IDX: u64 = 3;
const V_IDX: u64 = 7;

let agg = agg_attestation_fixture(SLOT, COMM_IDX, V_IDX);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Test fixtures are Phase0-only — Altair/Bellatrix/Capella/Deneb/Electra/Fulu payload arms are dead in CI.

agg_attestation_fixture (lines 1020-1042) hardcodes version: DataVersion::Phase0 and AttestationPayload::Phase0. The match arms in AttestationPayload::data() for the other six fork variants are never exercised, despite being on the hot path through store_agg_attestation (line 615) and ProposalBlock::root()/slot() (signeddata.rs:260-293).

Go's TestMemDBAggregator (memory_test.go:227-255) explicitly uses RandomDenebCoreVersionedAggregateAttestation, and TestMemDBProposer uses RandomBellatrixBeaconBlock. Add at least one agg_attestation_fixture_electra(...) and one bellatrix_proposal(...) fixture, and re-run mem_db_aggregator/mem_db_proposer/mem_db_clash_proposer against them. Particularly relevant because Electra/Fulu wrap a different concrete electra::Attestation type than the Phase0 family.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement core/dutydb

4 participants