Skip to content

tzel-operator: --dal-slot-range CLI flag for slot selection#28

Open
saroupille wants to merge 1 commit into
trilitech:mainfrom
saroupille:feat/dal-slot-range
Open

tzel-operator: --dal-slot-range CLI flag for slot selection#28
saroupille wants to merge 1 commit into
trilitech:mainfrom
saroupille:feat/dal-slot-range

Conversation

@saroupille
Copy link
Copy Markdown
Collaborator

Why

tzel-operator round-robins DAL slot publication across all slots reported by the protocol (0..number_of_slots returned by /protocol_parameters). There's currently no way to partition slot usage between operators sharing a DAL node — every operator competes for every slot, and the only collision-handling lives in the inner retry loop (already proposed → next slot).

This lands the smallest CLI knob that enables external coordination: each operator picks a disjoint slot subrange.

What

New flag: --dal-slot-range START..END (half-open, Rust idiom — 0..8 means slots 0..=7). Default: unset → behaviour unchanged.

# Operator A — first half
tzel-operator --dal-slot-range 0..16 ...

# Operator B — second half
tzel-operator --dal-slot-range 16..32 ...

How

  • New CLI arg dal_slot_range: Option<Range<u16>> with a custom value_parser (parse_dal_slot_range) that splits on .. and rejects empty/inverted/malformed input at parse-time.
  • New field dal_slot_range: Option<Range<u16>> on OperatorConfig.
  • select_slot_index now computes base + (counter % span) where (base, span) derives from the configured range; falls back to (0, number_of_slots) when unset.
  • publish_dal_chunk_with_protocol now bounds its retry loop on the range width (dal_slot_attempt_budget) instead of number_of_slots, and calls validate_dal_slot_range per publish — fail-fast with a clear message if the configured range exceeds the protocol's number_of_slots.

select_slot_index short-circuits on span == 0 before the AtomicU64 fetch_add, so a degenerate config (defensible-only path, unreachable via the CLI parser) does not waste counter ticks.

Test plan

  • cargo build -p tzel-services --bin tzel-operator — green (verified locally on top of origin/main).
  • cargo test -p tzel-services --bin tzel-operator parse_dal_slot_range — 2/2 green.
  • Parser smoke test — 8 inputs covering well-formed (0..8, 3..7, 1..4 w/ whitespace, 65534..65535 near-max) and the rejection paths (0..0 empty, 8..3 inverted, 5 missing separator, 0..bar non-numeric end, foo..5 non-numeric start, 0..65536 u16 overflow, "" empty):
    $ tzel-operator --dal-slot-range 8..3 ...
    error: invalid value '8..3' for '--dal-slot-range <DAL_SLOT_RANGE>':
      --dal-slot-range start (8) must be strictly less than end (3)
    
  • Adversarial pre-push review — one independent reviewer pass flagged 6 items; 3 applied (counter-before-guard reorder, error-message flag-name prefix, unit tests), 3 triaged out (per-publish protocol RPC was preexisting behaviour, not introduced; 0..1 collision-on-first-failure is intentional contract; concurrent-counter slot-skipping preexisted the patch and the commit message is now factual about it).
  • Runtime smoke in a sandbox is left to the integrator — the CLI/parse paths are exhaustively tested but the round-robin + retry behaviour with Some(range) set has not been observed against a live DAL node.

Compatibility

Default behaviour (no flag) is bit-for-bit unchanged: select_slot_index reduces to counter % number_of_slots, identical to the pre-patch implementation.

🤖 Generated with Claude Code

Previously the operator round-robined across all DAL slots reported by
the protocol (0..number_of_slots from the /protocol_parameters endpoint),
with no way to partition slot usage between multiple operators sharing
a DAL node.

Add `--dal-slot-range START..END` (half-open, Rust idiom) to restrict
the slot pool. Round-robin is preserved within the configured range.

Validation:
- Parse-time: clap value_parser rejects malformed input, empty ranges
  (start == end) and inverted ranges (start > end).
- Runtime: on each publish we re-validate `range.end <= number_of_slots`
  against the live DAL protocol parameters — fail fast with a clear
  message if the operator was configured outside the protocol-allowed
  band, or if the protocol shrinks number_of_slots out from under us.

The retry loop in `publish_dal_chunk_with_protocol` now bounds attempts
on the configured range width rather than `number_of_slots`. The
shared AtomicU64 means that under concurrent publishes a single chunk
attempt may revisit a slot or skip one before exhausting the range —
no behavioural change relative to the pre-patch shared-counter scheme,
but worth noting.

`select_slot_index` short-circuits on `span == 0` BEFORE bumping the
counter, so a degenerate config (unreachable via the CLI but defensible
at the API surface) does not burn counter ticks and break fairness for
concurrent callers.

Tests: 2 table-driven tests on `parse_dal_slot_range` covering well-
formed input (whitespace, max-u16-end) and the seven rejection paths
(empty range, inverted, missing separator, non-numeric start/end,
overflowing u16, empty string).

Default behaviour (no flag) is unchanged: full 0..number_of_slots
round-robin.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants