Skip to content

fakeip: persist metadata on every save interval, not just on Close#4140

Open
arthur109 wants to merge 93 commits into
SagerNet:testingfrom
arthur109:fakeip-metadata-async-save
Open

fakeip: persist metadata on every save interval, not just on Close#4140
arthur109 wants to merge 93 commits into
SagerNet:testingfrom
arthur109:fakeip-metadata-async-save

Conversation

@arthur109
Copy link
Copy Markdown

Summary

(*CacheFile).FakeIPSaveMetadataAsync has two bugs that compound to make the on-disk fakeip allocation counter advance only when Close() runs. On mobile (Android BoxService killed by OOM / am force-stop / phone reboot before clean teardown) the counter on disk falls behind reality. The next start loads the stale counter and the allocator in Store.Create() silently overwrites existing reverse-map entries because it doesn't check for collisions before storing.

The two bugs

// experimental/cachefile/fakeip.go (before)
func (c *CacheFile) FakeIPSaveMetadataAsync(metadata *adapter.FakeIPMetadata) {
    if c.saveMetadataTimer == nil {
        c.saveMetadataTimer = time.AfterFunc(C.FakeIPMetadataSaveInterval, func() {
            _ = c.FakeIPSaveMetadata(metadata)   // captures FIRST metadata
        })
    } else {
        c.saveMetadataTimer.Reset(C.FakeIPMetadataSaveInterval)   // never updates the closure
    }
}

Bug A — timer never fires under load. Create() calls this on every allocation. Every call after the first goes through .Reset(), pushing the deadline another 10 s. Any active workload (continuously resolving new domains) keeps the timer alive forever.

Bug B — even if it fires, it persists the wrong data. The first call's metadata pointer is captured by the closure. Subsequent calls construct fresh FakeIPMetadata{...} values and pass them in, but only the timer is reset — the closure is never replaced. A delayed fire would write the very first snapshot of the session.

Together, the only code path that ever writes correct metadata is Store.Close() → CacheFile.FakeIPSaveMetadata (synchronous, with current values).

Why it matters

Store.Create() does not check whether the proposed next IP is already in fakeip_address:

nextAddress := s.inet4Current.Next()
// ... range / wrap check ...
s.inet4Current = nextAddress
err := s.storage.FakeIPStore(address, domain)   // overwrites silently

So when Start() restores inet4Current from stale metadata, the first ~N allocations of the new session overwrite the reverse-map entries for IPs [stale_counter+1, actual_bucket_max]. The forward map (fakeip_domain*) still points to those IPs for the old domains, so any app that cached the prior DNS answer keeps connecting to fake IPs whose reverse-map now resolves to a different domain. The router dials the wrong outbound → TLS handshake fails with a cert mismatch → the affected hosts break, while everything else (newly-allocated or unaffected) keeps working.

Reproduced this in production on Android with Instagram: the file's fakeip_metadata had Inet4Current = 198.18.0.6 while fakeip_address held entries up to 198.18.0.40. After restart, ~34 allocations clobbered existing entries before the counter caught up. Profile pictures and chats kept working (those endpoints didn't get clobbered); reels, stories, posts, profile pages failed (their endpoints got reverse-map rewritten).

Fix

Track the latest metadata in a mutex-protected field on CacheFile, let the timer fire on its own schedule (no Reset), and on fire snapshot the latest pointer and clear the timer so the next allocation reschedules.

// after
func (c *CacheFile) FakeIPSaveMetadataAsync(metadata *adapter.FakeIPMetadata) {
    c.saveMetadataAccess.Lock()
    c.latestFakeIPMetadata = metadata
    if c.saveMetadataTimer == nil {
        c.saveMetadataTimer = time.AfterFunc(C.FakeIPMetadataSaveInterval, func() {
            c.saveMetadataAccess.Lock()
            m := c.latestFakeIPMetadata
            c.saveMetadataTimer = nil
            c.saveMetadataAccess.Unlock()
            if m != nil {
                _ = c.FakeIPSaveMetadata(m)
            }
        })
    }
    c.saveMetadataAccess.Unlock()
}

Two new fields on CacheFile: saveMetadataAccess sync.Mutex and latestFakeIPMetadata *adapter.FakeIPMetadata. Total: 12 added lines, 3 removed.

Behaviour after the patch

  • Allocations within a FakeIPMetadataSaveInterval window all see their latest metadata captured. The timer fires at most one save per interval; subsequent allocations reschedule a fresh interval.
  • Metadata on disk now lags reality by at most FakeIPMetadataSaveInterval (10 s by default) under continuous load, instead of being unbounded.
  • Close() semantics are unchanged.
  • No new goroutines, no busy-looping, no behavioural regression for users on platforms with reliable clean shutdown.

Verification

Tested end-to-end on a Samsung Android device running an embedded libbox:

Scenario counter on disk bucket max result
Pre-patch, after am force-stop cycles 198.18.0.6 198.18.0.40 next start: ~34 silent overwrites → IG breaks
Post-patch, idle session 198.18.0.12 198.18.0.12 matches; next allocation lands at .13
Post-patch, after am force-stop 198.18.0.12 198.18.0.12 survives unclean shutdown
Post-patch, after real phone reboot + new IG session 198.18.0.33 198.18.0.33 survives reboot; IG loads reels/stories/profiles

Test plan

  • Compiles (go build ./experimental/cachefile/)
  • Reproduced bug pre-patch on Android (Instagram broke after phone reboot)
  • Post-patch: metadata advances within ~10 s of allocations
  • Post-patch: metadata survives am force-stop
  • Post-patch: metadata survives full phone reboot
  • Post-patch: new allocations after restart land at bucket_max + 1, no collisions

🤖 Generated with Claude Code

macronut and others added 22 commits May 2, 2026 23:07
Signed-off-by: macronut <4027187+macronut@users.noreply.github.com>
FakeIPSaveMetadataAsync had two bugs that compound:

1. The 10s timer is debounced: every call invokes .Reset(), pushing the
   deadline another 10s. Under any active workload (every new domain
   triggers an allocation triggers a save call) the timer never fires —
   the only path that ever writes metadata is Close().

2. Even if the timer fired, the closure captures the metadata pointer
   from the FIRST call. Subsequent calls only Reset() the timer; the
   closure is never updated. A delayed fire would persist a stale
   snapshot from the start of the session.

Combined, the on-disk fakeip counter only advances on clean shutdown.
On mobile that almost never happens (process kill, OOM, phone reboot),
so the counter stays at whatever the last clean Close() wrote while
the buckets accumulate well past it.

Because Store.Create() doesn't check whether the next IP is already
allocated — it just calls FakeIPStore which silently overwrites — the
next start loads the stale counter and silently clobbers the reverse-map
entries between (counter, actual_max]. Forward-map (fakeip_domain*)
still points to those IPs for the old domains, so any app that cached
the previous DNS answer (real-world: Instagram, hours of TTL) hits a
fake IP that now reverse-maps to a different domain → router dials the
wrong outbound → TLS cert mismatch → those hosts break while everything
else looks fine.

Fix: track the latest metadata in a mutex-protected field on CacheFile,
let the timer fire on its own schedule (no Reset), and on fire snapshot
the latest pointer and clear the timer so the next allocation
reschedules. Metadata now tracks reality within one
FakeIPMetadataSaveInterval (10s) of any allocation activity.

Verified by reproducing on Android: pre-patch, counter on disk stuck
at .6 while buckets held .2–.40. Post-patch, counter advances within
~10s of new allocations and survives force-kill / restart cycles with
the saved value matching the bucket max, so the next allocation picks
an unused IP instead of overwriting an existing one.
@nekohasekai nekohasekai force-pushed the testing branch 8 times, most recently from abac453 to bf9ea6d Compare May 21, 2026 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants