Skip to content

srv_prepare: bump conntrack + SYN backlog for VPN workload#41

Merged
findias merged 1 commit into
mainfrom
fix/srv-prepare-conntrack-tuning
May 7, 2026
Merged

srv_prepare: bump conntrack + SYN backlog for VPN workload#41
findias merged 1 commit into
mainfrom
fix/srv-prepare-conntrack-tuning

Conversation

@findias
Copy link
Copy Markdown
Contributor

@findias findias commented May 6, 2026

Why

Default Ubuntu ships nf_conntrack_max=8192 — adequate for a desktop, dangerously small for a VPN aggregation point. EU production was silently dropping packets:

$ dmesg | grep nf_conntrack
nf_conntrack: table full, dropping packet  (×many, recent)

$ cat /proc/net/stat/nf_conntrack | awk 'NR==2 {print "drops:" $11 " inserts_failed:" $9}'
drops: 55667    inserts_failed: 29371

Compounds with tcp_timeout_established=432000 (5 days), which holds VPN slots even after clients disconnect.

Symptom: TLS handshakes from any source IP not already warm in the conntrack table fail intermittently. Discovered while bringing vm_my_ru2 online — new source IP, no warm entries, handshakes broke.

Fix

Adds four entries to srv_prepare_bbr_settings (the dict the role already loops through):

net.netfilter.nf_conntrack_max: 131072
net.netfilter.nf_conntrack_buckets: 131072
net.netfilter.nf_conntrack_tcp_timeout_established: 3600
net.ipv4.tcp_max_syn_backlog: 4096

Memory cost: ~50 MB on a 1vCPU/1GB box. Acceptable.

Status

  • Live patched on vm_my_srv via sysctl -w and /etc/sysctl.d/99-vpn-tuning.conf already (no waiting on this PR for relief).
  • This PR is the codification — every future srv_prepare apply will write the same values into /etc/sysctl.conf, so a fresh box gets them too.
  • All hosts in groups['cloud'] and groups['ru'] will receive the new values on next role apply.

Test plan

  • YAML validates (python3 -c "yaml.safe_load(open(..))")
  • dmesg no new conntrack-table-full entries since live patch (~30 min ago).
  • Reviewer note: this does NOT close the separate "TLS handshake from vm_my_ru2 fails on 3/4 v2 Reality SNIs" issue. That bug has a different root cause — TCP-level connectivity is 100%, conntrack pressure is gone, but the handshake still stalls at TLS layer. Tracked separately; needs tcpdump investigation on EU :443.

Closes a latent capacity bug that would have hit any rapidly-growing user base with or without multi-RU.

Default Ubuntu kernel ships nf_conntrack_max=8192 — adequate for a
desktop, dangerously small for a VPN aggregation point handling
hundreds of concurrent VLESS+Reality, XHTTP/H2, and probe flows.
Symptom on EU production: dmesg fills with "nf_conntrack: table
full, dropping packet"; new flows from any source IP not already
warm in the table fail their TLS handshake. Compounds with the
default tcp_timeout_established=432000 (5 days) which keeps slots
occupied long after the client disconnected.

Live cumulative counters on EU before the fix:
  insert_failed: 29371
  drop:          55667
  early_drop:    16

Settings added to srv_prepare_bbr_settings (sysctl applied via
/etc/sysctl.conf):

  net.netfilter.nf_conntrack_max:                          131072
  net.netfilter.nf_conntrack_buckets:                      131072
  net.netfilter.nf_conntrack_tcp_timeout_established:        3600
  net.ipv4.tcp_max_syn_backlog:                              4096

Memory cost: ~50 MB on a 1vCPU/1GB box (131072 entries × ~376 B).
Acceptable.

The same defaults belong on every host in groups['cloud'] and
groups['ru'] — srv_prepare runs on all of them. Hosts will pick up
the new values on the next role apply; no service restart needed
beyond the sysctl reload that role already triggers.

Live patch already applied to vm_my_srv via direct `sysctl -w` plus
/etc/sysctl.d/99-vpn-tuning.conf. The latter is now redundant once
this role runs and writes /etc/sysctl.conf with the same values —
keeping it temporarily until the next deploy normalises state.

Note: this fix did NOT resolve the separate TLS-handshake-fails-
from-vm_my_ru2 issue (3 of 4 v2 Reality SNIs). That bug has a
different root cause and will be investigated separately (likely
involving tcpdump on EU :443 during a fresh ru2-source handshake).

Signed-off-by: findias <findias@gmail.com>
@findias findias merged commit d8b735a into main May 7, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant