Skip to content

Transparent gateway stall under high normal connection churn (tun + gvisor + fakeip + WG) before host saturation #4144

@richfang

Description

@richfang

Version / Environment

  • sing-box: 1.13.11
  • OS: Ubuntu Linux x86_64
  • Deployment role: transparent office gateway
  • Config shape at incident time:
    • tun
    • stack: gvisor
    • auto_route: true
    • mtu: 1280
    • fakeip enabled
    • sniff enabled
    • log level debug
    • outbound split through WG-HongKong and nested WG-Singapore

Problem

Under normal office traffic, the gateway intermittently appears to “disconnect” for users for several seconds.

The important part is that this does not look like:

  • upstream packet loss
  • sing-box restart/crash
  • CPU saturation
  • link saturation
  • OOM

Instead, it looks like sing-box becomes internally congested first under a burst of normal business connections.

Incident window

Observed around:

  • 2026-05-18 14:07 to 14:10 (Asia/Shanghai, UTC+8)

Business destination IPs involved

These are confirmed normal company business services:

  • 106.75.138.153:443
  • 120.55.168.246:443
  • 106.75.169.134:443

Why this does not look like a simple host/network bottleneck

Upstream remained healthy

Gateway -> upstream gateway stayed at 0% packet loss.

No service or kernel failure

During the incident window:

  • journalctl -k: no NIC reset / link flap / kernel fault
  • journalctl -u sing-box: no restart or crash

CPU was not saturated

At 2026-05-18 14:10:00:

  • all CPUs: about 22%
  • busiest relevant core: about 37%

Memory increased sharply, but host did not OOM

Health log RSS trend:

  • 14:07: 4.33G
  • 14:08: 1.69G
  • 14:09: 1.92G
  • 14:10: 5.09G
  • 14:11: 6.85G
  • 14:12: 7.20G
  • 14:13: 8.05G

But there was:

  • no OOM kill
  • no swap storm
  • no reboot
  • no sing-box restart

NIC bandwidth was not saturated

Interface utilization was far below 1G saturation.

Evidence that sing-box fell behind on connection handling

Per-minute counters in the critical window:

  • 14:07: inbound=6952, packet=972, sniff=948, finished=2883
  • 14:08: inbound=7622, packet=1086, sniff=973, finished=2651
  • 14:09: inbound=7246, packet=1091, sniff=977, finished=4020
  • 14:10: inbound=2924, packet=993, sniff=477, finished=2936

This looks like:

  • from 14:07 to 14:09, new connections were arriving much faster than they were finishing
  • at 14:10, new inbound count dropped sharply
  • later 14:11-14:13 looked like retry/reconnect amplification

This feels more like “new connection handling fell behind” than “traffic disappeared”.

Other notes

  • FakeIP pool itself did not appear exhausted
  • We saw a missing fakeip record symptom, but not pool depletion
  • tun0 qlen=500 did not look like the main cause
  • The issue reproduced under normal company traffic, not obviously abnormal traffic

Question

Is this a known weak path or expected limitation for deployments shaped like:

  • tun
  • gvisor
  • fakeip
  • debug
  • transparent multi-user gateway
  • WG-based outbound split

From an operator perspective, sing-box appears to become the bottleneck in its internal connection/state path before the host itself is actually saturated.

Are there known mitigations or recommended deployment changes for this scenario?

Metadata

Metadata

Assignees

No one assigned

    Labels

    not following templateNecessary information is not provided or is incomplete

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions