Fix key ratchet race condition during DAVE epoch transitions#5
Fix key ratchet race condition during DAVE epoch transitions#5MohmmedAshraf wants to merge 2 commits intodisgoorg:masterfrom
Conversation
When a user joins or leaves a voice channel, Discord triggers a DAVE_PROTOCOL_PREPARE_EPOCH event. The prepareEpoch() function calls Session.Init() which destroys the old MLS crypto state. Discord then sends DavePrepareTransition which calls setupKeyRatchetForUser, but at this point GetKeyRatchet() returns a Go struct wrapping a NULL C pointer because the new MLS handshake hasn't completed yet. This causes SetPassthroughMode(false) to be called on the encryptor which expects encryption keys that don't exist yet, resulting in every Encrypt() call failing with ErrMissingKeyRatchet until the new MLS Welcome message arrives (~1 second later). This fix addresses the race in two places: 1. prepareEpoch: Reset encryptor and all decryptors to passthrough mode BEFORE calling Session.Init(), so audio can continue flowing unencrypted during the brief MLS renegotiation window rather than failing entirely. 2. setupKeyRatchetForUser: Guard against nil key ratchets returned by GetKeyRatchet(). If encryption is enabled but no valid key ratchet exists yet (because the MLS handshake is still in progress), skip the transition and keep passthrough mode active. The next MLS Welcome/Commit will call prepareTransition again with valid keys. 3. newKeyRatchet: Return nil when the underlying C handle is nil, matching the existing pattern used by newWelcomeResult. This prevents wrapping NULL C pointers in Go structs that appear non-nil to callers.
|
@davfsa can you review this? |
davfsa
left a comment
There was a problem hiding this comment.
I am a bit iffy about this, as the PR seems 100% AI, but I do know that what you mention is a bug, as I have ran into it.
This implementation of DAVE follows Discord's own JS implementation, which I believe also has this behaviour, so I am not sure if its intentional.
Regardless, the fixes you made are not relevant, and the only reason that you might see improvements and no decoding errors is because you set the passthrough mode to true, so decoding errors are just silenced
| kr := s.session.GetKeyRatchet(string(userID)) | ||
| if !disabled && kr == nil { | ||
| return | ||
| } |
There was a problem hiding this comment.
This will never happen. If it did, we would be getting crashes everywhere due to newKeyRatchet.
There was a problem hiding this comment.
that's literally how I found the bug lol. my bot was crashing nonstop with ErrMissingKeyRatchet every time someone joined/left a voice channel until I traced it back to this.
There was a problem hiding this comment.
that's literally how I found the bug lol. my bot was crashing nonstop with ErrMissingKeyRatchet every time someone joined/left a voice channel until I traced it back to this.
That isn't a crash, thats just a warning/error raised by the underlying library.
I will take a look when I find some time, but thank you for raising this issue. I am bit skeptical on changing the current behaviour because this is a very finiky system and I don't want to break it when DAVE fully rolls out and it breaks because we are always sending unencrypted packets / being unable to decrypt any packages correctly
There was a problem hiding this comment.
@davfsa honestly I'm glad you're pushing back, I'm not a Go developer so having someone who actually knows the internals review this properly is exactly what I need. switching from discordgo to disgo was the right move.
no rush, appreciate it 👍
| kr := s.session.GetKeyRatchet(string(userID)) | ||
| if !disabled && kr == nil { | ||
| return | ||
| } |
… ratchet - Remove SetPassthroughMode(true) in prepareEpoch — not how the reference impl handles it - Add warn logs when GetKeyRatchet returns nil after Init() resets MLS state - Keep nil guards in setupKeyRatchetForUser (the actual fix for the race) - Keep nil guard in newKeyRatchet (defensive, matches newWelcomeResult pattern) Logs will prove the prepare_epoch → prepare_transition(0) sequence triggers nil ratchets in production.
|
Hi @davfsa, yeah you're right on the passthrough in about the AI thing... yeah I use AI to help me write Go, I'm mainly a PHP developer. I have a production discord bot that needs DAVE working before March 1 and I was debugging this for hours before I got to this fix. regarding nil guards are not irrelevant. I just tested locally with diagnostic logs and here's what happens every time someone joins or leaves the voice channel: this hits every single time. the sequence:
the nil guard just skips touching the encryptor/decryptor when there are no keys yet. old ratchet stays in place, audio keeps working with previous epoch keys (which receivers retain for up to 10s per the spec), then once the MLS handshake completes everything switches to the new ratchet normally. also for what it's worth my integration is literally appreciate you taking the time to review this btw 🙏 |
|
@MohmmedAshraf can you still reproduce these issues? |
|
@RSMCx1 do you have any suggestions on how to reproduce your issue? |
|
Hi @topi314 sure i can help do you have discord or smth so i can share more details? mine is straidar |
|
just join the disgo discord linked in the readme :) |
Problem
When a user joins or leaves a DAVE-enabled voice channel, Discord triggers a
DAVE_PROTOCOL_PREPARE_EPOCHevent. The current code flow creates a race condition:prepareEpoch()callsSession.Init()which destroys the old MLS crypto stateDavePrepareTransition→setupKeyRatchetForUser(self)GetKeyRatchet()returns a Go struct wrapping a NULL C pointer (old state destroyed, new handshake not done yet)SetPassthroughMode(false)is called → encryptor expects encryption but has no keysEncrypt()call fails withErrMissingKeyRatchetfor ~1 second until the new MLS Welcome arrivesThis means every member join/leave in a DAVE-enabled channel causes ~1 second of complete audio failure for all bots using golibdave.
Root Cause
Two issues:
prepareEpochdestroys crypto state without a safety net.Session.Init()invalidates all key ratchets, but the encryptor/decryptors are still set topassthrough=false— so they try to encrypt/decrypt with destroyed keys.setupKeyRatchetForUserunconditionally disables passthrough. Even whenGetKeyRatchet()returns nil (NULL C handle wrapped in a non-nil Go struct), the code setsSetPassthroughMode(false)and proceeds. ThenewKeyRatchetfunction doesn't check for nil C handles, unlike the existingnewWelcomeResultwhich already does.Fix
Three targeted changes:
1.
prepareEpoch— reset to passthrough before destroying crypto state2.
setupKeyRatchetForUser— guard against nil key ratchetsIf encryption is enabled but
GetKeyRatchet()returns nil (MLS handshake still in progress), skip the transition and keep passthrough active. The next MLS Welcome/Commit will callprepareTransitionagain with valid keys.3.
newKeyRatchet— return nil for nil C handlesMatches the existing pattern used by
newWelcomeResult. Prevents wrapping NULL C pointers in Go structs that appear non-nil to callers.Testing
Tested in production with a Discord voice relay bot (listener + 3 speaker bots across multiple guilds):
pkt_dropped=0)ErrMissingKeyRatcheterrorsContext
DAVE becomes mandatory on March 1, 2026. This race condition affects any bot using golibdave in channels where members join/leave while audio is active — which is essentially all real-world usage.