Skip to content

Commit f32dd29

Browse files
Add OPS Management contributor documentation (#164)
* Add OPS Management contributor documentation - Add contribute-ops-management.md covering the portal, onboarding, incidents, maintenance, severity levels with RFC8-aligned examples, status lifecycle, root cause codes, and permissions - Add Incident & Maintenance Logging section to contribute-operations.md linking to the new guide - Add OPS Management to mkdocs nav under Contributors * Address PR review comments from Ben
1 parent 4b7c4e3 commit f32dd29

3 files changed

Lines changed: 256 additions & 0 deletions

File tree

docs/contribute-operations.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,17 @@
33

44
This guide covers the ongoing operational tasks for maintaining your DoubleZero Devices (DZDs), including agent upgrades, device/interface updates, and link management.
55

6+
## Incident & Maintenance Logging
7+
8+
Any planned maintenance or unplanned link/device issue should be logged in the [OPS Management portal](contribute-ops-management.md). This gives all contributors visibility into what is happening across the network and avoids duplicate investigation.
9+
10+
- **Planned work** (e.g. replacing an optic, scheduled carrier maintenance): create a maintenance record before you start.
11+
- **Unplanned issues** (e.g. link down, interface errors, packet loss): open an incident as soon as you begin investigating.
12+
13+
See the [OPS Management guide](contribute-ops-management.md) for onboarding steps and how to create tickets.
14+
15+
---
16+
617
**Prerequisites**: Before using this guide, ensure you have:
718

819
- Completed the [Device Provisioning Guide](contribute-provisioning.md)
@@ -362,3 +373,4 @@ doublezero link update --pubkey <LINK_PUBKEY> --delay-override-ms 0
362373

363374
> ⚠️ **Note:**
364375
> When a link is soft-drained, both `delay_ms` and `delay_override_ms` are overridden to 1000ms (1 second) to ensure deprioritization.
376+

docs/contribute-ops-management.md

Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
# OPS Management
2+
3+
The DoubleZero OPS Management portal is where contributors log and track incidents (unplanned outages) and maintenance (planned work) across the network. All tickets are visible to all contributors.
4+
5+
**Portal:** [https://doublezero.xyz/ops-management](https://doublezero.xyz/ops-management)
6+
7+
## Portal vs Slack
8+
9+
The OPS Management portal and Slack work together. All incidents and maintenance are tracked as tickets, accessible via the portal or the API. Each ticket notifies the right Slack channels automatically and gives every contributor a shared view of what is happening on the network. Slack is where the conversation happens: sharing logs, coordinating with other contributors, and collaborating on active issues.
10+
11+
Tickets are the canonical record, whether created via the portal or the API. Slack threads are not: they don't update ticket status and aren't stored permanently. Always keep the ticket status current, even if the conversation is happening in Slack.
12+
13+
The portal and Slack serve different purposes. Use both, but for the right things.
14+
15+
| Use the portal (or API) for... | Use Slack for... |
16+
|-------------------------------|-----------------|
17+
| Opening, updating, and closing tickets | Conversation and collaboration on an active issue |
18+
| Recording status transitions | Sharing logs, screenshots, or starting a call |
19+
| Assigning or escalating a ticket | Getting eyes on a problem quickly |
20+
| Setting root cause on close | Coordinating with other contributors |
21+
22+
23+
24+
---
25+
26+
## Onboarding
27+
28+
Complete these steps once before using the portal.
29+
30+
### 1. Set Your Ops Manager Key
31+
32+
Register a Solana wallet pubkey as your Ops Manager key. Supported wallets: Phantom, Solflare, Coinbase Wallet.
33+
34+
```bash
35+
doublezero contributor update \
36+
--ops-manager <OPS_MANAGER_PUBKEY> \
37+
--pubkey <CONTRIBUTOR_PUBKEY>
38+
```
39+
40+
### 2. Connect Your Wallet on the Portal
41+
42+
1. Navigate to [https://doublezero.xyz/ops-management](https://doublezero.xyz/ops-management).
43+
2. Click **Connect Your Wallet** and select your wallet.
44+
3. Sign the message to prove ownership of your Ops Manager key.
45+
46+
Once authenticated, the **Incident Tracking Table** shows.
47+
48+
### 3. Create API Keys (Optional)
49+
50+
For programmatic access instead of the web form:
51+
52+
1. Click **Manage API Keys** on the portal.
53+
2. Create one or more API keys.
54+
3. Download the API documentation from this page.
55+
56+
---
57+
58+
## Incidents
59+
60+
An incident is an unplanned service-impacting event.
61+
62+
### Severity Levels
63+
64+
Assign severity based on the impact to the DoubleZero network. You can update severity as the situation evolves.
65+
66+
| Severity | Impact | Response |
67+
|----------|--------|----------|
68+
| `sev1` | Full outage or major control/data plane breakage with no fallback | Drop everything immediately, even outside working hours. Escalate to DoubleZero Foundation immediately. |
69+
| `sev2` | Partial but substantial impact; degraded service with possible fallback | Treat as urgent. Coordinate actively. Overnight response required for sustained degradation. |
70+
| `sev3` | Limited or no user-visible impact; potential to escalate if unresolved | Top priority during working hours. Monitor closely. No after-hours escalation required unless impact increases. |
71+
72+
??? note "Severity examples"
73+
74+
**Sev1 examples**
75+
76+
- More than 10% of user traffic blackholed on DoubleZero, no fallback to public internet
77+
- More than 80% of user onboarding, connect, or disconnect attempts failing
78+
- More than 20% of DZDs reporting interface errors
79+
- Controller returning valid but incorrect configs to DZD agents
80+
81+
**Sev2 examples**
82+
83+
- More than 20% of users unable to send/receive traffic over DoubleZero tunnels, but failing back to public internet
84+
- 0–10% of user traffic blackholed on DoubleZero without fallback
85+
- 20–80% of new user onboarding, connect, or disconnect attempts failing
86+
- More than 20% of config agents failing to apply DZD config
87+
- 0–20% of DZDs reporting interface errors
88+
- Upstream issues causing observability loss (monitoring/alerting down)
89+
- Onchain data pipeline down or producing incorrect data
90+
- More than 20% of internet latency collection or submission failing
91+
- Controller inaccessible by DZD agents
92+
- Controller returning invalid configs to DZDs that will not be applied
93+
94+
**Sev3 examples**
95+
96+
- 0–20% of users unable to send/receive traffic over DoubleZero tunnels, with fallback to public internet
97+
- 0–20% of DZDs reporting interface errors
98+
- 0–20% of DZDs experiencing config agent failures
99+
- 0–20% of user onboarding, connect, or disconnect attempts failing
100+
- More than 20% of internet latency collection or submission failing for a single data provider
101+
- 0–20% of internet latency collection or submission failing for all data providers
102+
- Bugs or tech debt causing alerting noise that cannot be silenced
103+
- DIA down or ledger RPC networking issues for 0–20% of devices for several hours
104+
- Low-impact issues such as minor bugs, cosmetic errors, or isolated incidents not affecting customer traffic
105+
- Small fraction of devices intermittently reporting errors without service disruption
106+
107+
### Opening an Incident
108+
109+
Click **Create New Record**, select Type = **Incident** on the portal, or submit via the API.
110+
111+
**Required:**
112+
113+
| Field | Description |
114+
|-------|-------------|
115+
| `title` | Short summary (max 100 characters) |
116+
| `description` | Detailed explanation (max 500 characters) |
117+
| `severity` | `sev1`, `sev2`, or `sev3` |
118+
| `status` | Cannot be set to a terminal state (`resolved`, `closed`) on create |
119+
| Device and/or Link | At least one required. On the web form, select from a dropdown of your device and link codes. When using the API, pass the corresponding pubkeys as `device_pubkey` and/or `affected_link_pubkey`. |
120+
121+
**Optional:**
122+
123+
| Field | Description |
124+
|-------|-------------|
125+
| `reporter_name` / `reporter_email` | Your contact details |
126+
| `assignee` | Who is responsible for resolution |
127+
| `internal_reference` | Your internal ticket ID (e.g. Jira, ServiceNow) |
128+
| `start_at` | Defaults to creation time; editable |
129+
130+
Once created, a notification is posted to the contributor incidents Slack channel with the ticket ID, severity, affected devices/links, and contributor name.
131+
132+
### Updating an Incident
133+
134+
As the incident progresses, keep the ticket status current. This is the signal other contributors and DZ use to understand what's being worked on.
135+
136+
| Status | When to set it |
137+
|--------|----------------|
138+
| `open` | Initial state: issue reported, not yet being worked |
139+
| `acknowledged` | You've seen it and taken ownership |
140+
| `investigating` | Actively diagnosing: gathering logs, checking metrics |
141+
| `mitigating` | Root cause known or suspected; applying a fix or workaround |
142+
| `monitoring` | Fix applied; watching to confirm it holds |
143+
| `resolved` | Issue confirmed fixed; **root cause required** |
144+
| `closed` | Fully complete; no further action; **root cause required** |
145+
146+
```
147+
open → acknowledged → investigating → mitigating → monitoring → resolved → closed
148+
```
149+
150+
You can skip statuses if appropriate. For example, jump straight from `open` to `investigating` if you immediately start working it. Always use the most accurate status for the current state.
151+
152+
Each status update posts a reply in the original Slack notification thread.
153+
154+
### Closing an Incident
155+
156+
To move an incident to `resolved` or `closed`, a **root cause** must be set. You can set root cause at any earlier stage if you already know it; it becomes mandatory at close.
157+
158+
| Code | Description |
159+
|------|-------------|
160+
| `hardware` | Hardware repair, replacement, or upgrade (SFP, NIC, cable, device) |
161+
| `software` | Software or firmware fix, update, or restart |
162+
| `configuration` | Configuration change, fix, or rollback |
163+
| `capacity` | Congestion, capacity limits, or traffic management |
164+
| `carrier` | Circuit, wavelength, or cross-connect provider issue |
165+
| `network_external` | External network issue outside contributor control |
166+
| `facility` | Datacenter infrastructure issue (power, cooling) |
167+
| `fiber_cut` | Physical fiber damage repaired |
168+
| `security` | Security incident mitigated |
169+
| `human_error` | Operational mistake corrected |
170+
| `false_positive` | No actual issue found after investigation |
171+
| `duplicate` | Already tracked in another ticket |
172+
| `self_resolved` | Issue resolved without intervention |
173+
| `dz_managed` | Issue with a DoubleZero-managed software component (activator, controller, etc.) |
174+
175+
---
176+
177+
## Maintenance
178+
179+
A maintenance record is a planned, time-bounded activity that may affect availability. Create it in advance so other contributors can see and avoid conflicting windows.
180+
181+
### Scheduling Maintenance
182+
183+
Click **Create New Record** > **Maintenance** on the portal, or submit via the API.
184+
185+
**Required:**
186+
187+
| Field | Description |
188+
|-------|-------------|
189+
| `title` | Short summary (max 100 characters) |
190+
| `description` | Detailed explanation (max 500 characters) |
191+
| `start_at` | Planned start time (UTC) |
192+
| `end_at` | Planned end time (UTC); must be after `start_at` |
193+
| Device and/or Link | At least one required. On the web form, select from a dropdown of your device and link codes. When using the API, pass the corresponding pubkeys as `device_pubkey` and/or `affected_link_pubkey`. |
194+
195+
Once created, a notification is posted to the contributor maintenance Slack channel with the ticket ID, affected devices/links, planned window, and contributor name.
196+
197+
### Managing Maintenance Status
198+
199+
Keep the status current as the window progresses.
200+
201+
| Status | When to set it |
202+
|--------|----------------|
203+
| `planned` | Scheduled, not yet started |
204+
| `in-progress` | Work has begun |
205+
| `completed` | Work finished successfully |
206+
| `closed` | Auto-set 24 hours after `end_at` |
207+
| `cancelled` | Called off before or during execution |
208+
209+
```
210+
planned → in-progress → completed → closed (auto 24h after end_at)
211+
↓ ↓
212+
└──────────┴──→ cancelled
213+
```
214+
215+
---
216+
217+
## Permissions and Escalation
218+
219+
### What Contributors Can Do
220+
221+
- Create and manage tickets for their own devices and links only.
222+
- Assign tickets to themselves or escalate to DZ/Malbeclabs.
223+
- View all tickets across all contributors.
224+
225+
### What DZ/Malbeclabs Admins Can Do
226+
227+
- Create tickets for any contributor's devices and links.
228+
- Assign or reassign tickets between contributors.
229+
- Handle escalations and support requests.
230+
231+
### DZX Link Ownership
232+
233+
DZX links connect devices from two different contributors. The **A-side** contributor (first device in the link name) owns the link and is the only one who can create tickets for it.
234+
235+
**Example:** For link `deviceA:deviceB`, the contributor who owns `deviceA` owns the link.
236+
237+
**If the issue is on the Z-side:**
238+
239+
1. A-side contributor creates a ticket for the DZX link.
240+
2. Assign the ticket to DZ/Malbeclabs.
241+
3. DZ/Malbeclabs investigates and reassigns to the Z-side contributor if needed.
242+
243+
We recognise this workflow is limited. Z-side contributors currently cannot create tickets for DZX links they don't own, which means coordination has to go through DZ/Malbeclabs. We are working to improve this so that both sides of a DZX link can declare incidents and maintenance independently.

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ nav:
6868
- Requirements & Architecture: contribute.md
6969
- Device Provisioning: contribute-provisioning.md
7070
- Operations: contribute-operations.md
71+
- OPS Management: contribute-ops-management.md
7172
- Architecture: architecture.md
7273
- Glossary: glossary.md
7374
markdown_extensions:

0 commit comments

Comments
 (0)