A pure eBPF/XDP Carrier-Grade NAT (CGNAT) implementation with native hairpinning support.
Existing eBPF NAT implementations like einat-ebpf have limitations:
-
Hairpinning requires kernel hacks - Uses TC (Traffic Control) hooks which process packets after the kernel routing decision. When a packet is destined for a local IP, Linux routes it via the
localtable directly to localhost, bypassing the network interface entirely. The eBPF program never sees these packets. -
Workarounds are fragile - The current solution involves policy-based routing manipulation:
# Reprioritize routing tables ip rule add pref 200 lookup local ip rule del pref 0 lookup local # Force packets out the external interface ip rule add from <internal_subnet> lookup <custom_table>
Plus manual ARP entries. This is kernel-dependent and error-prone.
-
Not 100% eBPF - Relies on kernel conntrack and routing subsystems.
Build a CGNAT that is:
- 100% eBPF/XDP - Bypass the kernel networking stack entirely
- Native hairpinning - Use
XDP_REDIRECTto handle hairpin scenarios without routing hacks - High performance - XDP processes packets before the kernel, achieving 10M+ pps
- RFC compliant - Follow NAT behavioral requirements
| Aspect | TC (Traffic Control) | XDP (eXpress Data Path) |
|---|---|---|
| Hook point | After routing decision | Before kernel sees packet |
| Hairpinning | Requires routing hacks | XDP_REDIRECT to any interface |
| Performance | ~2M pps | ~10M+ pps |
| Kernel bypass | Partial | Complete |
When Client A (10.0.0.1) wants to reach Client B (10.0.0.2) via the public IP (203.0.113.1:port):
┌─────────────────────────────────────────────────────────────────┐
│ XDP Program │
├─────────────────────────────────────────────────────────────────┤
│ 1. Packet arrives: src=10.0.0.1 dst=203.0.113.1:port │
│ 2. Lookup: 203.0.113.1:port maps to internal 10.0.0.2:8080 │
│ 3. Rewrite: src=203.0.113.1 dst=10.0.0.2:8080 │
│ 4. XDP_REDIRECT → internal interface RX queue │
│ │
│ Kernel routing stack: NEVER INVOLVED │
└─────────────────────────────────────────────────────────────────┘
Implement stateful connection tracking entirely in eBPF maps:
┌────────────────────┐ ┌────────────────────┐
│ NAT Binding Map │ │ Connection Table │
├────────────────────┤ ├────────────────────┤
│ internal_ip:port │────▶│ state (NEW/EST/FIN)│
│ external_ip:port │ │ timeout │
│ protocol │ │ packet/byte counts │
└────────────────────┘ └────────────────────┘
-
RFC 5508 - NAT Behavioral Requirements for ICMP
- ICMP Query session handling
- ICMP Error forwarding with embedded payload translation
- Hairpinning requirements for ICMP
-
RFC 7857 - Updates to NAT Behavioral Requirements
- Endpoint-Independent Mapping (EIM)
- Endpoint-Independent Filtering (EIF)
- Address pooling requirements
- Port allocation recommendations
- RFC 4787 - NAT Behavioral Requirements for UDP
- RFC 5382 - NAT Behavioral Requirements for TCP
- RFC 6146 - Stateful NAT64 (future consideration)
- RFC 6888 - Common Requirements for CGNAT
- XDP program skeleton with interface attachment
- Basic packet parsing (Ethernet, IP, TCP/UDP)
- NAT binding map structure
- Outbound SNAT (source NAT)
- Inbound DNAT (destination NAT)
- Detect hairpin scenarios (dst matches external IP)
- Implement
XDP_REDIRECTfor hairpin packets - Handle both directions of hairpin flows
- Stateful connection table in eBPF maps
- TCP state machine tracking (SYN, ESTABLISHED, FIN, etc.)
- UDP timeout handling
- ICMP session tracking
- ICMP Query mapping (echo request/reply)
- ICMP Error translation (rewrite embedded headers)
- ICMP hairpinning
- Port allocation in eBPF with atomic counter
- Per-CPU statistics collection
- Incremental checksum updates (RFC 1624)
- Endpoint-Independent Mapping/Filtering modes (future)
- Binding expiration/cleanup (future)
cgnat-ebpf/
├── cgnat-common/ # Shared types between userspace and eBPF
├── cgnat-ebpf/ # XDP eBPF program (compiled to BPF bytecode)
├── cgnat/ # Userspace loader and CLI
├── Makefile # Build automation
└── README.md
- Linux kernel 5.15+ (for BPF features)
- Rust nightly toolchain
- bpf-linker
- clang/llvm (for BPF compilation)
# Install Rust nightly and dependencies
make deps
# Or manually:
rustup install nightly
rustup component add rust-src --toolchain nightly
cargo install bpf-linker# Build everything (eBPF + userspace)
make build
# Debug build
make debug
# Build only eBPF program
make build-ebpf
# Build only userspace
make build-user# Run with sudo (XDP requires CAP_NET_ADMIN)
sudo ./target/release/cgnat \
-e eth0 \ # External interface
-i eth1 \ # Internal interface
-E 203.0.113.1 \ # External (public) IP
-I 10.0.0.0/8 # Internal subnet
# Or use make
make run ARGS="-e eth0 -i eth1 -E 203.0.113.1 -I 10.0.0.0/8"# TODO: Network namespace based tests
# Will create isolated test environments with veth pairs- einat-ebpf - Reference implementation (limitations documented above)
- einat-ebpf Issue #4 - Hairpinning routing problem
- Aya - Rust eBPF framework
- XDP Tutorial - Learning XDP
MIT OR Apache-2.0
| Project | Organization | Scale | Notes |
|---|---|---|---|
| Katran | Meta/Facebook | Millions of connections | L4 load balancer with XDP, handles Facebook's traffic |
| Cilium | Isovalent/Cisco | Kubernetes clusters | Full NAT in eBPF, replaces kube-proxy + iptables |
| einat-ebpf | Open source | Home/small ISP | Full Cone NAT, but has hairpinning limitations (uses TC hooks) |
| eBPF BNG | Open source | ISP edge (OLT) | Includes NAT44/CGNAT module, proposed as future of ISP edge |
Most production CGNAT deployments use:
- Dedicated appliances: A10, F5, Juniper ($50K-$500K)
- DPDK-based solutions: VPP, custom implementations (100+ Gbps)
- Kernel netfilter: iptables/nftables with conntrack (simplest but slowest)
| Approach | Packets/sec (per core) | Latency | Source |
|---|---|---|---|
| iptables/nftables | ~1-2M pps | ~10-50μs | Industry benchmarks |
| XDP | 10-26M pps | <1μs | Cloudflare |
| DPDK | 20-40M pps | <1μs | Various |
| Hardware appliance | Line rate | <1μs | Vendor specs |
tests/bench_compare.sh was run in cgnat, iptables, and nftables modes on the namespace/veth testbed
with --skb-mode and offloads disabled (BENCH_DISABLE_OFFLOADS=1).
3-run mean results:
| Mode | TCP Throughput (Mbps) | UDP Throughput (Mbps) | TCP Connect Rate (cps) |
|---|---|---|---|
| cgnat | 2980.9 | 1229.5 | 12162.3 |
| iptables | 2762.9 | 1140.4 | 10710.3 |
| nftables | 2774.6 | 1141.2 | 12574.3 |
Observed delta (mean):
- cgnat TCP throughput vs iptables:
+7.9% - cgnat TCP throughput vs nftables:
+7.4% - cgnat UDP throughput vs iptables/nftables:
+~7.8%
Notes:
- These numbers are useful for regression tracking and MVP signal.
- This environment is generic XDP (
skb) on a virtualized setup, not native driver XDP on physical NICs. - Do not present these as production line-rate claims until validated on target hardware.
Reproduce:
sudo env PING_COUNT=20 TCP_DURATION=5 UDP_DURATION=5 CONNECT_ATTEMPTS=300 \
BENCH_DISABLE_OFFLOADS=1 ./tests/bench_compare.sh --modes cgnat,iptables,nftablesTraditional iptables path:
NIC → Driver → sk_buff allocation → netfilter hooks → conntrack → NAT → routing → output
XDP path:
NIC → Driver → XDP program (NAT here) → redirect/TX
↑
No sk_buff, no conntrack, no routing stack
From Cilium's documentation:
XDP hooks into a very early ingress path at the driver layer, where it operates with direct access to the packet's DMA buffer. This is effectively as low-level as it can get.
| Solution | Cost | Throughput |
|---|---|---|
| A10 Thunder CGN | $100K-$500K | 100 Gbps |
| Juniper MX CGNAT | $50K-$200K | 40 Gbps |
| Commodity server + XDP | $5K-$15K | 40-100 Gbps |
| Feature | This Project | Production-Ready |
|---|---|---|
| Basic SNAT/DNAT | ✅ | ✅ |
| Hairpinning (XDP_REDIRECT) | ✅ | ✅ |
| Port allocation (eBPF atomic) | ✅ | ✅ |
| Checksums (RFC 1624) | ✅ | ✅ |
| ICMP translation (RFC 5508) | ✅ | ✅ |
| Connection tracking | ✅ (basic) | Needs timeout/cleanup |
| Logging (RFC 6888) | ❌ | Required for ISPs |
| Port Block Allocation | ❌ | Reduces logging overhead |
| Multiple external IPs | ❌ | Required at scale |
| ALGs (FTP, SIP) | ❌ | Sometimes needed |
| HA/Failover | ❌ | Critical for production |
- Maturity: DPDK and hardware appliances have 10+ years of production hardening
- Features: Full RFC compliance (logging, port block allocation, ALGs) is complex
- Expertise: eBPF development requires specialized skills
- Support: Vendors provide 24/7 support; open source doesn't
From the eBPF BNG article:
For edge deployment (10-40 Gbps per OLT), eBPF/XDP is simpler and sufficient... This is the future of ISP edge infrastructure.
The industry is moving toward eBPF/XDP for:
- Edge/access networks: Where cost matters more than peak performance
- Cloud-native: Kubernetes, containers (Cilium dominates here)
- DDoS mitigation: XDP's speed is unmatched for packet filtering
- Cloudflare - How to drop 10 million packets per second
- Cilium BPF/XDP Reference Guide
- einat-ebpf - eBPF Full Cone NAT
- eBPF BNG - Killing the ISP Appliance
- iptables vs eBPF - Why Kubernetes is Moving On
- Tigera - eBPF: When and When Not to Use It
IPv4 exhaustion is complete — all five Regional Internet Registries have depleted their free pools. Over 17% of eyeball ASes and 90%+ of cellular ASes now rely on CGNAT (Cloudflare 2024 research). There are ~16,870 ISPs worldwide, and CGNAT is a must-have, not a nice-to-have.
Cost arbitrage drives adoption:
| Approach | Cost per 10K subscribers |
|---|---|
| Buy IPv4 addresses ($15–52/IP) | ~$250,000 |
| Hardware CGNAT (A10 Thunder) | $63,000–$445,000 |
| Software CGNAT on commodity x86 | ~$10,000–$25,000 |
Software-defined CGNAT on commodity hardware represents a 10–25x cost reduction vs. dedicated appliances.
| Company | Technology | Outcome |
|---|---|---|
| Isovalent (Cilium) | eBPF/XDP | Acquired by Cisco for ~$650M (32x ARR), raised $69M total |
| Tigera (Calico) | eBPF dataplane | $43M raised, 8M+ nodes/day |
| NFWare | VPP/DPDK vCGNAT | $3.9M raised, 100+ ISP customers |
| Groundcover | eBPF observability | $60M raised through Series B |
NFWare is the closest comp — they validated the software CGNAT market with $3.9M in funding and bootstrapped to 100+ ISP customers. Their approach uses VPP/DPDK (kernel bypass). Our eBPF/XDP approach stays in-kernel, which is architecturally simpler and aligns with the direction Cilium proved at scale.
What works (strong for seed stage):
- Full SNAT/DNAT/hairpin via pure XDP — no kernel routing hacks
- Stateful TCP/UDP/ICMP connection tracking in eBPF maps
- A/B benchmark suite proving parity or better vs. iptables/nftables on equal footing
- RFC 5508 (ICMP), RFC 1624 (checksums) compliance
What's needed before raising:
- Bare-metal benchmarks on real NICs (ConnectX-5 or E810) — the veth/SKB numbers (3 Gbps) are valid for regression testing but don't show XDP's true capability. On real hardware, expect 10–40 Gbps/server (matching a $63K appliance on a $5K server).
- One ISP design partner or LOI — every funded company in this space had a named customer at seed (NFWare had Telefonica, RtBrick had Deutsche Telekom, DriveNets had AT&T).
What can wait (build with funding):
- Multi-IP pools, Port Block Allocation, RFC 6888 logging, HA/failover
- These are expected gaps at seed stage
The veth/SKB benchmark environment uses generic XDP — the slowest execution mode. On real hardware with native XDP:
| Config | Throughput | Source |
|---|---|---|
| This PoC (veth/SKB mode) | 3 Gbps | Measured |
| XDP native, single core, ConnectX-5 | 8–10 Mpps (~30–40 Gbps) | CoNEXT 2018 XDP paper |
| XDP redirect, multi-core | 80–100+ Mpps | Mellanox mlx5 benchmarks |
| NFWare vCGNAT (VPP, x86) | 231 Gbps | Intel builder report |
XDP achieves ~80% of DPDK throughput while staying fully in-kernel — no dedicated cores, no kernel bypass, simpler operations model.
Based on comps: $2M–$4M pre-seed/seed with bare-metal validation and one ISP pilot. Capital-efficient path modeled on NFWare ($3.9M total → 100+ customers).
The veth/SKB benchmarks prove correctness and relative advantage. To generate investor-ready numbers (10–40 Gbps), we need native XDP on real NICs.
Cloud (cheapest, fastest to set up):
- Hetzner dedicated (~€40–60/month) — Intel X710 (i40e driver), full native XDP. Best value.
- AWS c5n.xlarge (~$0.50–1.00/hr) — ENA driver supports XDP native mode. Two instances in same placement group.
- GCP c2-standard-8 — gVNIC supports XDP.
Bare metal (best numbers):
- Any machine with two physical NICs that support XDP native mode
- Supported NICs: Intel i40e (X710), Intel ice (E810), Mellanox mlx5 (ConnectX-5/6)
Machine A (traffic gen) Machine B (CGNAT) Machine C (server)
10.0.0.1/24 10.0.0.254 (internal)
iperf3 client ──── NIC ──────── NIC1 NIC2 ──── NIC ── 203.0.113.254
203.0.113.1 (external) iperf3 server
Or two machines with Machine B having two NICs (internal + external).
# Native mode (no --skb-mode flag)
sudo ./target/release/cgnat run \
-e eth1 -i eth0 -E 203.0.113.1 -I 10.0.0.0/24| Metric | Target | Would prove |
|---|---|---|
| TCP throughput (single core) | 10+ Gbps | Matches $63K A10 appliance |
| TCP throughput (multi-core) | 30–40 Gbps | Matches $200K+ appliance |
| Packets per second | 5+ Mpps | XDP advantage over iptables |
| Retransmits | ~0 | No packet corruption |
These numbers on a $5K server vs. a $63K appliance is the pitch slide.
- Bare-metal benchmark on native XDP with real NICs
- Binding expiration and cleanup (userspace timer + eBPF map iteration)
- Multiple external IP address pool support
- Port Block Allocation (PBA) per RFC 7422 to reduce logging
- Logging infrastructure for compliance (RFC 6888)
- Endpoint-Independent Mapping/Filtering mode configuration
- Performance benchmarking suite (
tests/bench_compare.sh) - HA/failover with state synchronization