Version / Environment
- sing-box: 1.13.11
- OS: Ubuntu Linux x86_64
- Deployment role: transparent office gateway
- Config shape at incident time:
tun
stack: gvisor
auto_route: true
mtu: 1280
fakeip enabled
sniff enabled
- log level
debug
- outbound split through
WG-HongKong and nested WG-Singapore
Problem
Under normal office traffic, the gateway intermittently appears to “disconnect” for users for several seconds.
The important part is that this does not look like:
- upstream packet loss
- sing-box restart/crash
- CPU saturation
- link saturation
- OOM
Instead, it looks like sing-box becomes internally congested first under a burst of normal business connections.
Incident window
Observed around:
2026-05-18 14:07 to 14:10 (Asia/Shanghai, UTC+8)
Business destination IPs involved
These are confirmed normal company business services:
106.75.138.153:443
120.55.168.246:443
106.75.169.134:443
Why this does not look like a simple host/network bottleneck
Upstream remained healthy
Gateway -> upstream gateway stayed at 0% packet loss.
No service or kernel failure
During the incident window:
journalctl -k: no NIC reset / link flap / kernel fault
journalctl -u sing-box: no restart or crash
CPU was not saturated
At 2026-05-18 14:10:00:
- all CPUs: about
22%
- busiest relevant core: about
37%
Memory increased sharply, but host did not OOM
Health log RSS trend:
14:07: 4.33G
14:08: 1.69G
14:09: 1.92G
14:10: 5.09G
14:11: 6.85G
14:12: 7.20G
14:13: 8.05G
But there was:
- no OOM kill
- no swap storm
- no reboot
- no sing-box restart
NIC bandwidth was not saturated
Interface utilization was far below 1G saturation.
Evidence that sing-box fell behind on connection handling
Per-minute counters in the critical window:
14:07: inbound=6952, packet=972, sniff=948, finished=2883
14:08: inbound=7622, packet=1086, sniff=973, finished=2651
14:09: inbound=7246, packet=1091, sniff=977, finished=4020
14:10: inbound=2924, packet=993, sniff=477, finished=2936
This looks like:
- from
14:07 to 14:09, new connections were arriving much faster than they were finishing
- at
14:10, new inbound count dropped sharply
- later
14:11-14:13 looked like retry/reconnect amplification
This feels more like “new connection handling fell behind” than “traffic disappeared”.
Other notes
- FakeIP pool itself did not appear exhausted
- We saw a
missing fakeip record symptom, but not pool depletion
tun0 qlen=500 did not look like the main cause
- The issue reproduced under normal company traffic, not obviously abnormal traffic
Question
Is this a known weak path or expected limitation for deployments shaped like:
tun
gvisor
fakeip
debug
- transparent multi-user gateway
- WG-based outbound split
From an operator perspective, sing-box appears to become the bottleneck in its internal connection/state path before the host itself is actually saturated.
Are there known mitigations or recommended deployment changes for this scenario?
Version / Environment
tunstack: gvisorauto_route: truemtu: 1280fakeipenabledsniffenableddebugWG-HongKongand nestedWG-SingaporeProblem
Under normal office traffic, the gateway intermittently appears to “disconnect” for users for several seconds.
The important part is that this does not look like:
Instead, it looks like sing-box becomes internally congested first under a burst of normal business connections.
Incident window
Observed around:
2026-05-18 14:07to14:10(Asia/Shanghai, UTC+8)Business destination IPs involved
These are confirmed normal company business services:
106.75.138.153:443120.55.168.246:443106.75.169.134:443Why this does not look like a simple host/network bottleneck
Upstream remained healthy
Gateway -> upstream gateway stayed at
0% packet loss.No service or kernel failure
During the incident window:
journalctl -k: no NIC reset / link flap / kernel faultjournalctl -u sing-box: no restart or crashCPU was not saturated
At
2026-05-18 14:10:00:22%37%Memory increased sharply, but host did not OOM
Health log RSS trend:
14:07:4.33G14:08:1.69G14:09:1.92G14:10:5.09G14:11:6.85G14:12:7.20G14:13:8.05GBut there was:
NIC bandwidth was not saturated
Interface utilization was far below 1G saturation.
Evidence that sing-box fell behind on connection handling
Per-minute counters in the critical window:
14:07:inbound=6952,packet=972,sniff=948,finished=288314:08:inbound=7622,packet=1086,sniff=973,finished=265114:09:inbound=7246,packet=1091,sniff=977,finished=402014:10:inbound=2924,packet=993,sniff=477,finished=2936This looks like:
14:07to14:09, new connections were arriving much faster than they were finishing14:10, new inbound count dropped sharply14:11-14:13looked like retry/reconnect amplificationThis feels more like “new connection handling fell behind” than “traffic disappeared”.
Other notes
missing fakeip recordsymptom, but not pool depletiontun0 qlen=500did not look like the main causeQuestion
Is this a known weak path or expected limitation for deployments shaped like:
tungvisorfakeipdebugFrom an operator perspective, sing-box appears to become the bottleneck in its internal connection/state path before the host itself is actually saturated.
Are there known mitigations or recommended deployment changes for this scenario?