Transparent gateway stall under high normal connection churn (tun + gvisor + fakeip + WG) before host saturation

### Version / Environment

- sing-box: 1.13.11
- OS: Ubuntu Linux x86_64
- Deployment role: transparent office gateway
- Config shape at incident time:
  - `tun`
  - `stack: gvisor`
  - `auto_route: true`
  - `mtu: 1280`
  - `fakeip` enabled
  - `sniff` enabled
  - log level `debug`
  - outbound split through `WG-HongKong` and nested `WG-Singapore`

### Problem

Under normal office traffic, the gateway intermittently appears to “disconnect” for users for several seconds.

The important part is that this does **not** look like:
- upstream packet loss
- sing-box restart/crash
- CPU saturation
- link saturation
- OOM

Instead, it looks like sing-box becomes internally congested first under a burst of **normal business connections**.

### Incident window

Observed around:

- `2026-05-18 14:07` to `14:10` (Asia/Shanghai, UTC+8)

### Business destination IPs involved

These are confirmed normal company business services:

- `106.75.138.153:443`
- `120.55.168.246:443`
- `106.75.169.134:443`

### Why this does not look like a simple host/network bottleneck

#### Upstream remained healthy
Gateway -> upstream gateway stayed at `0% packet loss`.

#### No service or kernel failure
During the incident window:
- `journalctl -k`: no NIC reset / link flap / kernel fault
- `journalctl -u sing-box`: no restart or crash

#### CPU was not saturated
At `2026-05-18 14:10:00`:
- all CPUs: about `22%`
- busiest relevant core: about `37%`

#### Memory increased sharply, but host did not OOM
Health log RSS trend:
- `14:07`: `4.33G`
- `14:08`: `1.69G`
- `14:09`: `1.92G`
- `14:10`: `5.09G`
- `14:11`: `6.85G`
- `14:12`: `7.20G`
- `14:13`: `8.05G`

But there was:
- no OOM kill
- no swap storm
- no reboot
- no sing-box restart

#### NIC bandwidth was not saturated
Interface utilization was far below 1G saturation.

### Evidence that sing-box fell behind on connection handling

Per-minute counters in the critical window:

- `14:07`: `inbound=6952`, `packet=972`, `sniff=948`, `finished=2883`
- `14:08`: `inbound=7622`, `packet=1086`, `sniff=973`, `finished=2651`
- `14:09`: `inbound=7246`, `packet=1091`, `sniff=977`, `finished=4020`
- `14:10`: `inbound=2924`, `packet=993`, `sniff=477`, `finished=2936`

This looks like:
- from `14:07` to `14:09`, new connections were arriving much faster than they were finishing
- at `14:10`, new inbound count dropped sharply
- later `14:11-14:13` looked like retry/reconnect amplification

This feels more like “new connection handling fell behind” than “traffic disappeared”.

### Other notes

- FakeIP pool itself did **not** appear exhausted
- We saw a `missing fakeip record` symptom, but not pool depletion
- `tun0 qlen=500` did not look like the main cause
- The issue reproduced under **normal company traffic**, not obviously abnormal traffic

### Question

Is this a known weak path or expected limitation for deployments shaped like:

- `tun`
- `gvisor`
- `fakeip`
- `debug`
- transparent multi-user gateway
- WG-based outbound split

From an operator perspective, sing-box appears to become the bottleneck in its internal connection/state path **before** the host itself is actually saturated.

Are there known mitigations or recommended deployment changes for this scenario?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Transparent gateway stall under high normal connection churn (tun + gvisor + fakeip + WG) before host saturation #4144

Version / Environment

Problem

Incident window

Business destination IPs involved

Why this does not look like a simple host/network bottleneck

Upstream remained healthy

No service or kernel failure

CPU was not saturated

Memory increased sharply, but host did not OOM

NIC bandwidth was not saturated

Evidence that sing-box fell behind on connection handling

Other notes

Question

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Transparent gateway stall under high normal connection churn (tun + gvisor + fakeip + WG) before host saturation #4144

Description

Version / Environment

Problem

Incident window

Business destination IPs involved

Why this does not look like a simple host/network bottleneck

Upstream remained healthy

No service or kernel failure

CPU was not saturated

Memory increased sharply, but host did not OOM

NIC bandwidth was not saturated

Evidence that sing-box fell behind on connection handling

Other notes

Question

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions