Harden networking: auth, memory-safety, DoS, and thread-safety fixes#101
Conversation
Addresses the confirmed Critical/High issues (plus two low-risk Mediums) from an adversarial review of the networking stack. Each fix ships with a regression test; green on `zig build test`, Debug, ReleaseFast, ReleaseSafe. Authentication / spoofing - Verify the Ed25519 signature on org control messages (revoke/alias/vouch) and require a trusted org before acting on them (swim.zig, codec.zig). - Enforce RX cryptokey routing: a decrypted packet's inner source IP must belong to the sending peer, for IPv4 and IPv6 (device.zig). - Gossiped dead/leave/suspect about a third party now only marks it suspected locally and is confirmed by our own failure detector before any eviction; gossip can no longer clear local suspicion or resurrect a peer (swim.zig). - DNS bootstrap: random per-query transaction ID, connected socket for source filtering, and transaction-ID/question validation (dns.zig). Memory safety (safety checks are off in the shipped ReleaseFast build) - Bound the base64 token decode against the output buffer before decoding, fixing a stack out-of-bounds write on an over-long mg:// token (coordinated_punch.zig). - Return expired-peer keys via a caller-owned buffer instead of a slice into a reclaimed stack frame (membership.zig). Resource exhaustion / DoS - Cap the membership table and reclaim non-alive entries so unauthenticated gossip cannot grow it without bound (membership.zig). - Rate-limit inbound WireGuard handshake initiations (per-source + global token bucket) before the X25519 (device.zig). Thread safety - Add an RwLock to the WgDevice peer table (exclusive for mutation/handshake, shared for the data plane) so peer removal cannot zero key material while a worker is mid-encrypt (device.zig, main.zig). - Add an RwLock to the membership table; the single writer takes it exclusively for mutations, external readers (data plane, FFI) share it (membership.zig, main.zig, meshguard_ffi.zig). Service policy - Classify IPv6 packets in the service filter and fail closed on anything unclassifiable under a default-deny policy (policy.zig).
The H4 rate-limit change added a src_ip parameter to WgDevice.handleInitiation; wg_interop.zig (built only on Linux) still used the old 1-arg call, breaking the ubuntu CI build. macOS doesn't compile this binary, so local builds missed it.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 70e3460cb8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (!innerSourceAllowed(peer.mesh_ip, peer.mesh_ip6, out[0..len])) { | ||
| return error.SourceSpoofed; |
There was a problem hiding this comment.
Allow framed tunnel payloads through decryptTransport
When the FFI tunnel API is used, meshguard_tunnel_send deliberately frames arbitrary payloads with a leading 0x00 header before encrypting them, and the receiver routes those packets through handleWgPacket → dev.decryptTransport. This new unconditional inner-IP check makes decryptTransport return SourceSpoofed for any non-IPv4/IPv6 plaintext, so every framed tunnel message is dropped after decryption rather than being delivered to the tunnel inbox. The source-IP enforcement needs to be limited to TUN IP packets or provide a separate decrypt path for the framed tunnel transport.
Useful? React with 👍 / 👎.
| if (!self.hs_limiter.allow(src_ip, nowNs())) { | ||
| return error.HandshakeRateLimited; |
There was a problem hiding this comment.
Avoid charging auto-registered handshakes twice
In the FFI responder path, the first packet from a membership-known peer that is not yet in WgDevice is handled by calling handleInitiation, catching UnknownPeer, auto-registering, and then immediately retrying handleInitiation. Since the limiter token is consumed here before the UnknownPeer lookup, that single legitimate first handshake burns one token on the failed probe and another on the retry; if the source has only one token left, the retry is rate-limited and no response is sent. Consider not consuming the bucket for UnknownPeer probes or making the auto-register retry reuse the admission decision.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Hardening pass across Meshguard’s networking stack to close multiple security and reliability gaps in the WireGuard/userspace data plane, SWIM membership handling, DNS bootstrap, and service-policy enforcement, with accompanying regression tests and documentation updates.
Changes:
- Add WireGuard peer-table/thread-safety protections (device
RwLock) plus a handshake-init token-bucket limiter and RX inner-source enforcement. - Harden discovery/membership against spoofing and resource exhaustion (org control signature verification, gossip “dead” handling, membership cap + reclaim, safer
expireSuspectedAPI). - Fix service policy IPv6 bypass and DNS bootstrap response spoofing weaknesses (random TXID, connected UDP, response validation).
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/wireguard/device.zig | Adds handshake rate limiter, peer-table RwLock, and RX inner-source enforcement plus tests. |
| src/services/policy.zig | Adds IPv6-capable packet classification and a fail-closed allowPacket API plus tests. |
| src/protocol/codec.zig | Adds canonical signed-payload builders for org control message verification. |
| src/net/dns.zig | Randomizes DNS TXID, connects UDP socket, validates response matches query, adds tests. |
| src/nat/coordinated_punch.zig | Bounds base64 decode output to prevent OOB writes; adds regression tests. |
| src/meshguard_ffi.zig | Adds membership read locks for several exported FFI entry points; updates WG initiation call signature. |
| src/main.zig | Uses new service-filter API and adds WG device shared locks in some dataplane paths. |
| src/discovery/swim.zig | Verifies org control signatures and changes gossip “dead/leave/suspect” handling; updates expire API usage; adds tests. |
| src/discovery/membership.zig | Adds membership cap + reclaim, introduces RwLock, fixes expireSuspected use-after-return, adds tests. |
| SECURITY.md | Documents the new security properties (RX routing, signed org control, confirmed failure detection, rate limiting). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Per-source bucket (hash → slot, reset on collision with a different IP). | ||
| const h = (((@as(usize, ip[0]) *% 131 +% ip[1]) *% 131 +% ip[2]) *% 131 +% ip[3]); | ||
| const b = &self.buckets[h % SLOTS]; |
| /// Handle an incoming handshake initiation (we are responder). | ||
| /// O(1) lookup via static key table instead of O(N) peer iteration. | ||
| pub fn handleInitiation(self: *WgDevice, msg: *const noise.HandshakeInitiation) !struct { response: noise.HandshakeResponse, slot: usize } { | ||
| // Step 0: Verify MAC1 BEFORE expensive DH (anti-DoS gate) | ||
| pub fn handleInitiation(self: *WgDevice, msg: *const noise.HandshakeInitiation, src_ip: [4]u8) !struct { response: noise.HandshakeResponse, slot: usize } { | ||
| // Step 0: Verify MAC1 BEFORE expensive DH (cheap, no lock). |
| if (!std.ascii.eqlIgnoreCase(data[HEADER_LEN .. HEADER_LEN + q_end], q_buf[0..q_end])) return false; | ||
| if (readU16(data, HEADER_LEN + q_end) != qtype) return false; // QTYPE echoed | ||
| return true; |
| const existing = self.peers.get(peer.pubkey); | ||
| if (existing) |e| { | ||
| // Only update if the incoming info is newer (higher Lamport timestamp) | ||
| if (peer.lamport <= e.lamport) return; | ||
| try self.peers.put(peer.pubkey, peer); | ||
| return; | ||
| } |
| if (wg_dev.decryptTransport(pkt, &out_buf)) |dec| { | ||
| // Service filter: check port access before TUN write | ||
| if (lib.services.Policy.parseTransportHeader(out_buf[0..dec.len])) |ti| { | ||
| const peer = wg_dev.peers[dec.slot] orelse continue; | ||
| const org_pk = if (membership.peers.getPtr(peer.identity_key)) |mp| mp.org_pubkey else null; | ||
| if (!service_filter.check(peer.identity_key, org_pk, ti.proto, ti.dst_port)) continue; | ||
| // Service filter: check access before TUN write (IPv4 + IPv6, M5). | ||
| if (wg_dev.peers[dec.slot]) |peer| { | ||
| const org_pk = orgPubkeyLocked(membership, peer.identity_key); | ||
| if (!service_filter.allowPacket(peer.identity_key, org_pk, out_buf[0..dec.len])) continue; | ||
| } |
| if (wg_dev.decryptTransport(pkt, &decrypt_storage[n_decrypted.*])) |result| { | ||
| // Check service filter before buffering | ||
| const PolicyMod = lib.services.Policy; | ||
| if (PolicyMod.parseTransportHeader(decrypt_storage[n_decrypted.*][0..result.len])) |ti| { | ||
| if (wg_dev.peers[result.slot]) |peer| { | ||
| const org_pk = if (swim.membership.peers.getPtr(peer.identity_key)) |mp| mp.org_pubkey else null; | ||
| if (!service_filter.check(peer.identity_key, org_pk, ti.proto, ti.dst_port)) return; | ||
| } | ||
| // Check service filter before buffering (IPv4 + IPv6, M5). | ||
| if (wg_dev.peers[result.slot]) |peer| { | ||
| const org_pk = orgPubkeyLocked(swim.membership, peer.identity_key); | ||
| if (!service_filter.allowPacket(peer.identity_key, org_pk, decrypt_storage[n_decrypted.*][0..result.len])) return; | ||
| } |
Code-review follow-ups on the hardening PR: - P1 (regression): the H1 RX source-IP check rejected non-IP plaintext, which dropped every FFI framed tunnel message (framed with a leading-zero length header → version nibble 0). innerSourceAllowed now only constrains IPv4/IPv6 packets (the IP-routed ones); non-IP framed payloads pass to the inbox. - H6 race: the daemon re-read wg_dev.peers[slot] after decryptTransport had released the device lock — a data race vs removePeer that could also skip the service filter. decryptTransport now returns the peer identity_key (captured under the lock); all four daemon filter blocks and the FFI use it, so peers[] is never re-read post-decrypt. - P2: the FFI auto-register path charged the handshake rate limiter twice for one legitimate handshake (initial UnknownPeer probe + retry). Added handleInitiationAdmitted for the retry, which skips re-admission. - DNS: responseMatchesQuery now also validates the echoed QCLASS. - membership.upsert: free the previous name allocation on the update path (consistent with remove/deinit) to avoid a leak. - device.zig: make the handshake-limiter hash operands explicitly usize; give the initiation handlers a shared named return type (InitiationResult). Verified: zig build test, Debug, ReleaseFast, ReleaseSafe, and x86_64-linux cross-compile (Debug + ReleaseFast).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ab85b27106
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| wg_dev.lock.lockSharedUncancelable(zio()); | ||
| defer wg_dev.lock.unlockShared(zio()); |
There was a problem hiding this comment.
Release the device lock before the next batch
In this loop, defer is scoped to cryptoWorkerPipeline, not to the current iteration, so after the first popped batch each crypto worker keeps a shared wg_dev.lock held until the thread exits (including across continue and the next crypto_q.pop()). Any later exclusive operation on the same device lock, such as initiateHandshake, handleResponse, or removePeer, can then block indefinitely once data-plane traffic has passed through the worker, preventing rekeys/handshakes and peer cleanup in userspace mode.
Useful? React with 👍 / 👎.
Hardening pass over the networking stack (
src/net,src/nat,src/discovery,src/protocol,src/wireguard) addressing confirmed Critical/High issues plus two low-risk Mediums from an adversarial review. Every fix has a regression test. Green onzig build test, Debug, ReleaseFast (shipping config), ReleaseSafe, andzig fmt.What's hardened
Authentication & spoofing
dead/leave/suspectabout a third party now only marks it suspected locally and must be confirmed by our own failure detector before eviction; gossip can no longer clear local suspicion or resurrect a peer.Memory safety (the shipped ReleaseFast build runs with safety checks off)
mg://token).Resource exhaustion / DoS
Thread safety
RwLockon theWgDevicepeer table (exclusive for mutation/handshake, shared for the data plane) so peer removal can't zero key material mid-encrypt.RwLockon the membership table — single writer takes it exclusively, external readers (data plane, FFI) share it.Service policy
Tests
9 new regression tests covering each fix (membership cap/reclaim, expired-peer buffer, org-signature verification, gossip non-eviction, base64 bound, DNS response validation, IPv6 filter fail-closed, handshake rate limiter, RX source-IP enforcement).
Notes for review
🤖 Generated with Claude Code