Skip to content

Commit c43c227

Browse files
rfc17: update based on PR review comments
- last_bgp_reported_at updated on every write (not only on transitions) - telemetry agent writes on state change or after periodic refresh interval (~1h) - removed user.status == Activated validation constraint - instruction variant changed from 94 to TBD (94-103 are taken) - expanded UserBGPSession alternative with rejection rationale - removed resolved open question about periodic reconfirmation writes
1 parent c749cbf commit c43c227

1 file changed

Lines changed: 24 additions & 23 deletions

File tree

rfcs/rfc17-user-bgp-status.md

Lines changed: 24 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -33,8 +33,11 @@ connection diagnostics.
3333

3434
- **Store status in S3/ClickHouse only** — already done for raw socket stats. Not queryable
3535
onchain and not accessible to other onchain programs.
36-
- **Separate PDA account per user** — avoids resizing the User account but adds a new account
37-
type and complicates reads. Rejected for simplicity.
36+
- **Separate `UserBGPSession` PDA per user** — isolates BGP state from the User account and
37+
avoids resizing it. Rejected because BGP status has a strict 1:1 relationship with the user,
38+
the Device account must always be read to verify write authority regardless, and splitting the
39+
data would require reading two accounts for every consumer that queries a user's connection
40+
state.
3841

3942
## Detailed Design
4043

@@ -52,21 +55,21 @@ Add three fields to the end of the `User` struct:
5255
transitioned to `Up`. Zero means the session has never been observed Up.
5356

5457
3. `last_bgp_reported_at: u64` (8 bytes, DZ ledger slot) — the last slot when the
55-
telemetry agent successfully wrote a BGP status change for this user. Updated only
56-
when `bgp_status` transitions to a different value. Consumers can use this field to
57-
detect agent silence: if `last_bgp_reported_at` is older than a threshold, the
58-
`bgp_status` value should be treated as stale rather than authoritative, avoiding
59-
false `Up` readings when the agent has stopped reporting.
58+
telemetry agent successfully wrote a BGP status update for this user. Updated on
59+
every `SetUserBGPStatus` write, whether or not `bgp_status` changed. Consumers can
60+
use this field to detect agent silence: if `last_bgp_reported_at` is older than a
61+
threshold, the `bgp_status` value should be treated as stale rather than
62+
authoritative, avoiding false `Up` readings when the agent has stopped reporting.
6063

6164
The `SetUserBGPStatus` instruction reallocates the account by 17 bytes on first write
6265
(1 + 8 + 8), with the metrics publisher covering any additional rent. `last_bgp_up_at`
63-
and `last_bgp_reported_at` are both updated only when the status value changes.
66+
is updated only when the status transitions to `Up`.
6467

65-
### New instruction: SetUserBGPStatus (variant 94)
68+
### New instruction: SetUserBGPStatus (variant TBD)
6669

6770
Accounts: user (writable), device (readonly), metrics_publisher (signer + writable).
6871

69-
Validation: signer == device.metrics_publisher_pk, user.device_pk == device, user.status == Activated.
72+
Validation: signer == device.metrics_publisher_pk, user.device_pk == device.
7073

7174
### Telemetry collector
7275

@@ -75,18 +78,21 @@ After each BGP socket collection tick in `collectBGPStateSnapshot`:
7578
1. Fetch activated users for this device from the serviceability program.
7679
2. Map each user to its BGP peer IP: `overlay_dst_ip = user.TunnelNet[0:4]`, last octet +1.
7780
3. For each user: Up if a socket with matching RemoteIP exists, Down otherwise.
78-
4. Enqueue one `SetUserBGPStatus` transaction per user into a non-blocking background
79-
worker. The worker retries failed submissions independently so that a single RPC
80-
error or congested transaction does not delay other users or block the collection
81-
tick. The metrics publisher keypair is already loaded in the telemetry agent.
81+
4. For each user, submit `SetUserBGPStatus` if: (a) the computed status differs from
82+
the last known onchain value, or (b) the last write was more than a configurable
83+
interval ago (e.g., 1h), to keep `last_bgp_reported_at` fresh for staleness
84+
detection. Submissions are enqueued into a non-blocking background worker that
85+
retries independently so that a single RPC error does not delay other users or
86+
block the collection tick. The metrics publisher keypair is already loaded in the
87+
telemetry agent.
8288

8389
The raw TCP snapshot upload to S3 continues unchanged.
8490

8591
## Impact
8692

8793
- Serviceability program: one new instruction, seventeen new bytes on User accounts (1 byte `bgp_status` + 8 bytes `last_bgp_up_at` + 8 bytes `last_bgp_reported_at`).
88-
- Telemetry agent: one extra RPC call per collection tick to fetch users; N transactions
89-
per tick (one per activated user on the device).
94+
- Telemetry agent: one extra RPC call per collection tick to fetch users; up to N transactions
95+
per tick (one per user whose status changed, or whose periodic refresh interval has elapsed).
9096
- Read SDKs (Go, TypeScript, Python): update User deserialization for the new field.
9197

9298
## Security Considerations
@@ -131,15 +137,10 @@ On a devnet device with at least one activated user:
131137
- Should there be a grace period before marking a session `Down`? A single missed tick
132138
due to a transient collection error would incorrectly transition an active user to
133139
`Down`. One option is to require N consecutive `Down` observations before writing.
134-
- Since the agent only writes on status changes, `last_bgp_reported_at` will not
135-
advance for stable sessions, making it impossible to distinguish a healthy long-lived
136-
`Up` session from a silent agent. Should the agent periodically send a reconfirmation
137-
write (e.g., every N days) even when the status has not changed, to keep
138-
`last_bgp_reported_at` fresh and preserve staleness detection?
139140
- Should we implement per-user rate limiting to prevent RPC saturation caused by
140141
constant BGP flaps? A user cycling Up/Down rapidly would generate a transaction on
141-
every tick; a cooldown window or minimum time-between-writes per user account could
142-
bound the worst-case submission rate.
142+
every state-change; a cooldown window or minimum time-between-writes per user account
143+
could bound the worst-case submission rate.
143144
- How should recurring circuit flaps be handled? A user whose BGP session repeatedly
144145
drops and recovers within short windows may indicate an unstable circuit rather than
145146
a transient error. Should the data model track a flap counter or a flap rate to

0 commit comments

Comments
 (0)