conn, tun: replace pre-made buffers with an allocator pattern#49
conn, tun: replace pre-made buffers with an allocator pattern#49
Conversation
496dc1a to
61b6c72
Compare
ea6f82a to
e981b07
Compare
buffer/buffer.go
Outdated
| // The returned Data slice must not be retained past Release. | ||
| type Buffer struct { | ||
| data []byte | ||
| recycle func(*Buffer) |
There was a problem hiding this comment.
Currently, recycle is always nil.
I'd expect that different buffer.Source implementations would want to use different recycle implementations (e.g., tsvnic vs rioconn vs sync.Pool).
What's the plan here?
There was a problem hiding this comment.
Currently, buffer.FragmentPool sets the unexported recycle field directly, but external packages won't be able to do that, so I'm wondering what the API surface should look like.
Or is the expectation that all allocators be implemented in this package? We can explore that for rioconn, but perhaps not for tsvnic. This is what we currently have for rioconn: https://github.com/tailscale/tailscale/blob/nickkhyl/rioconn/net/rioconn/request_ring.go
There was a problem hiding this comment.
Do either of them need more than uint64 of metadata? Per my original message, I considered just packing a metadata field into every buffer. And yes, making them all public.
If we need more, interfaces become an appealing option.
There was a problem hiding this comment.
yeah, I'd expect uint64 (or perhaps even uintptr) to be sufficient.
We can iterate on the specifics later, since it seems we're on the same page about the API, and there's also no expectation that all implementations should live in wireguard-go/buffer.
There was a problem hiding this comment.
After some deliberation, I opted to instead embed an interface -- I/O implementations may stuff it with arbitrary data as needed. LMK what you think.
buffer/buffer.go
Outdated
| package buffer | ||
|
|
||
| const ( | ||
| MaxMessageSize = (1 << 16) - 1 // largest possible UDP datagram |
There was a problem hiding this comment.
does this relate to a similar constant defined in device/queueconstants*, or should it be independent?
tun/tun_windows.go
Outdated
| switch err { | ||
| case nil: | ||
| if bufs[0] == nil { | ||
| bufs[0] = buffer.New(make([]byte, buffer.MaxMessageSize), nil) |
There was a problem hiding this comment.
Related to https://github.com/tailscale/wireguard-go/pull/49/changes#r2955084809
This would have previously been sized to 2016 bytes. Other platforms vary as well, e.g. android, iOS.
|
I ran some local, single TCP stream benchmarks w/iperf3 on Ubuntu 24.04 w/Intel i5-12400 CPUs. 4184faf (what's currently used by tailscale/tailscale): RSS is much improved w/lots of streams (-P 32): 1GB+ vs ~200MB |
| messageBuffers *WaitPool | ||
| inboundElements *WaitPool | ||
| outboundElements *WaitPool | ||
| messageBuffers buffer.Source |
There was a problem hiding this comment.
reminder: We will need to support a limit, e.g. PreallocatedBuffersPerPool constant today, on the outstanding buffers or outstanding aggregate buffer size for mobile platforms , which WaitPool previously enforced with its max field set to nonzero value.
device/send.go
Outdated
|
|
||
| err = peer.SendBuffers([][]byte{buf}) | ||
| err = peer.SendBuffers([][]byte{buf.Data()}) | ||
| buf.Release() |
There was a problem hiding this comment.
Can we defer release immediately following instantiation? Same with all the other handshake and initiation message sending methods.
tun/tun_linux.go
Outdated
| } | ||
| file := os.NewFile(uintptr(fd), "/dev/tun") | ||
| bufPool := buffer.NoErrSource{Source: buffer.NewDefaultFragmentPool()} | ||
| arenaBuffer, _ := bufPool.Get(32 * 16 << 10) |
There was a problem hiding this comment.
Is this sizing arbitrary?
Can we document the decimal value in human-readable form?
|
|
||
| type setGSOFunc func(control *[]byte, gsoSize uint16) | ||
|
|
||
| func coalesceMessages(addr *net.UDPAddr, ep *StdNetEndpoint, bufs [][]byte, offset int, msgs []ipv6.Message, setGSO setGSOFunc) int { |
There was a problem hiding this comment.
I would love to see the scatter-gather changes land in tailscale/tailscale (net/batching pkg) before larger tailscale/wireguard-go changes.
My understanding is that:
- they are a prerequisite for the larger packet memory architecture changes in this PR
- they provide a modest performance improvement by collapsing memcpy/linearization into one place, kernel-side
- they are widely supported: if sendmmsg() is supported, scatter-gather is supported through it
- they also benefit peer relay, which uses net/batching
While I don't anticipate any issues with the approach, it would be better (and easier) to identify kernel bugs or whatever early and detached from the larger changes here.
472f369 to
74691ad
Compare
d7467f5 to
f357c5a
Compare
On Linux NativeTun will write a vector of coalesced buffers via writev(2) instead of copying fragments into a single buffer. Updates tailscale/corp#36989 Signed-off-by: Alex Valiushko <alexvaliushko@tailscale.com> Change-Id: Id1e9cd3cc9435c240b7a28ae6528cd7e6a6a6964
20d395d to
e016f9e
Compare
Signed-off-by: Alex Valiushko <alexvaliushko@tailscale.com> Change-Id: I58908d9d3fd09441e9378a74b0ee19136a6a6964
Signed-off-by: Alex Valiushko <alexvaliushko@tailscale.com> Change-Id: I58908d9d3fd09441e9378a74b0ee19136a6a6964
e016f9e to
5c07b88
Compare
coalesceMessages) uses scatter-gather instead of copying into a contiguous sliceArenafor temporary scratch buffersReduces resident memory by an order of magnitude & improves (best case, iperf) throughput by about 10%.
Updates tailscale/corp#36989
Change-Id: I1a72cd54d4be2b3b0c7568982e1a700c6a6a6964