Skip to content

Fix key ratchet race condition during DAVE epoch transitions#5

Open
MohmmedAshraf wants to merge 2 commits intodisgoorg:masterfrom
MohmmedAshraf:fix/dave-epoch-key-ratchet-race
Open

Fix key ratchet race condition during DAVE epoch transitions#5
MohmmedAshraf wants to merge 2 commits intodisgoorg:masterfrom
MohmmedAshraf:fix/dave-epoch-key-ratchet-race

Conversation

@MohmmedAshraf
Copy link
Copy Markdown

Problem

When a user joins or leaves a DAVE-enabled voice channel, Discord triggers a DAVE_PROTOCOL_PREPARE_EPOCH event. The current code flow creates a race condition:

  1. prepareEpoch() calls Session.Init() which destroys the old MLS crypto state
  2. Discord then sends DavePrepareTransitionsetupKeyRatchetForUser(self)
  3. GetKeyRatchet() returns a Go struct wrapping a NULL C pointer (old state destroyed, new handshake not done yet)
  4. SetPassthroughMode(false) is called → encryptor expects encryption but has no keys
  5. Every Encrypt() call fails with ErrMissingKeyRatchet for ~1 second until the new MLS Welcome arrives

This means every member join/leave in a DAVE-enabled channel causes ~1 second of complete audio failure for all bots using golibdave.

Root Cause

Two issues:

  1. prepareEpoch destroys crypto state without a safety net. Session.Init() invalidates all key ratchets, but the encryptor/decryptors are still set to passthrough=false — so they try to encrypt/decrypt with destroyed keys.

  2. setupKeyRatchetForUser unconditionally disables passthrough. Even when GetKeyRatchet() returns nil (NULL C handle wrapped in a non-nil Go struct), the code sets SetPassthroughMode(false) and proceeds. The newKeyRatchet function doesn't check for nil C handles, unlike the existing newWelcomeResult which already does.

Fix

Three targeted changes:

1. prepareEpoch — reset to passthrough before destroying crypto state

func (s *session) prepareEpoch(epoch int, protocolVersion uint16) {
    if epoch != mlsNewGroupExpectedEpoch {
        return
    }

    // NEW: Safe fallback — passthrough while MLS renegotiates
    s.encryptor.SetPassthroughMode(true)
    for _, dec := range s.decryptors {
        dec.TransitionToPassthroughMode(true)
    }

    s.session.Init(protocolVersion, uint64(s.channelID), string(s.selfUserID))
}

2. setupKeyRatchetForUser — guard against nil key ratchets

If encryption is enabled but GetKeyRatchet() returns nil (MLS handshake still in progress), skip the transition and keep passthrough active. The next MLS Welcome/Commit will call prepareTransition again with valid keys.

3. newKeyRatchet — return nil for nil C handles

Matches the existing pattern used by newWelcomeResult. Prevents wrapping NULL C pointers in Go structs that appear non-nil to callers.

Testing

Tested in production with a Discord voice relay bot (listener + 3 speaker bots across multiple guilds):

  • Continuous audio relay with zero packet drops (pkt_dropped=0)
  • Multiple epoch transitions triggered by member joins/leaves — all completed successfully
  • Key ratchets correctly established after each MLS handshake
  • No ErrMissingKeyRatchet errors

Context

DAVE becomes mandatory on March 1, 2026. This race condition affects any bot using golibdave in channels where members join/leave while audio is active — which is essentially all real-world usage.

When a user joins or leaves a voice channel, Discord triggers a
DAVE_PROTOCOL_PREPARE_EPOCH event. The prepareEpoch() function calls
Session.Init() which destroys the old MLS crypto state. Discord then
sends DavePrepareTransition which calls setupKeyRatchetForUser, but
at this point GetKeyRatchet() returns a Go struct wrapping a NULL C
pointer because the new MLS handshake hasn't completed yet.

This causes SetPassthroughMode(false) to be called on the encryptor
which expects encryption keys that don't exist yet, resulting in every
Encrypt() call failing with ErrMissingKeyRatchet until the new MLS
Welcome message arrives (~1 second later).

This fix addresses the race in two places:

1. prepareEpoch: Reset encryptor and all decryptors to passthrough
   mode BEFORE calling Session.Init(), so audio can continue flowing
   unencrypted during the brief MLS renegotiation window rather than
   failing entirely.

2. setupKeyRatchetForUser: Guard against nil key ratchets returned by
   GetKeyRatchet(). If encryption is enabled but no valid key ratchet
   exists yet (because the MLS handshake is still in progress), skip
   the transition and keep passthrough mode active. The next MLS
   Welcome/Commit will call prepareTransition again with valid keys.

3. newKeyRatchet: Return nil when the underlying C handle is nil,
   matching the existing pattern used by newWelcomeResult. This
   prevents wrapping NULL C pointers in Go structs that appear
   non-nil to callers.
@topi314
Copy link
Copy Markdown
Member

topi314 commented Feb 23, 2026

@davfsa can you review this?

Copy link
Copy Markdown
Contributor

@davfsa davfsa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit iffy about this, as the PR seems 100% AI, but I do know that what you mention is a bug, as I have ran into it.

This implementation of DAVE follows Discord's own JS implementation, which I believe also has this behaviour, so I am not sure if its intentional.

Regardless, the fixes you made are not relevant, and the only reason that you might see improvements and no decoding errors is because you set the passthrough mode to true, so decoding errors are just silenced

Comment thread golibdave/golibdave.go Outdated
Comment thread golibdave/golibdave.go
Comment on lines +236 to +239
kr := s.session.GetKeyRatchet(string(userID))
if !disabled && kr == nil {
return
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will never happen. If it did, we would be getting crashes everywhere due to newKeyRatchet.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's literally how I found the bug lol. my bot was crashing nonstop with ErrMissingKeyRatchet every time someone joined/left a voice channel until I traced it back to this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's literally how I found the bug lol. my bot was crashing nonstop with ErrMissingKeyRatchet every time someone joined/left a voice channel until I traced it back to this.

That isn't a crash, thats just a warning/error raised by the underlying library.

I will take a look when I find some time, but thank you for raising this issue. I am bit skeptical on changing the current behaviour because this is a very finiky system and I don't want to break it when DAVE fully rolls out and it breaks because we are always sending unencrypted packets / being unable to decrypt any packages correctly

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davfsa honestly I'm glad you're pushing back, I'm not a Go developer so having someone who actually knows the internals review this properly is exactly what I need. switching from discordgo to disgo was the right move.

no rush, appreciate it 👍

Comment thread golibdave/golibdave.go
Comment on lines +248 to +251
kr := s.session.GetKeyRatchet(string(userID))
if !disabled && kr == nil {
return
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto the above

… ratchet

- Remove SetPassthroughMode(true) in prepareEpoch — not how the reference impl handles it
- Add warn logs when GetKeyRatchet returns nil after Init() resets MLS state
- Keep nil guards in setupKeyRatchetForUser (the actual fix for the race)
- Keep nil guard in newKeyRatchet (defensive, matches newWelcomeResult pattern)

Logs will prove the prepare_epoch → prepare_transition(0) sequence triggers
nil ratchets in production.
@MohmmedAshraf
Copy link
Copy Markdown
Author

Hi @davfsa, yeah you're right on the passthrough in prepareEpoch, removed it.. pushed the update, logs added so you can reproduce.

about the AI thing... yeah I use AI to help me write Go, I'm mainly a PHP developer. I have a production discord bot that needs DAVE working before March 1 and I was debugging this for hours before I got to this fix.

regarding nil guards are not irrelevant. I just tested locally with diagnostic logs and here's what happens every time someone joins or leaves the voice channel:

15:17:05 WARN prepareEpoch: resetting MLS session via Init epoch=1 protocol_version=1                                                                                                                                                    
15:17:05 WARN nil key ratchet for self after GetKeyRatchet user_id=1437901958676 protocol_version=1                                                                                                                                
15:17:11 WARN prepareEpoch: resetting MLS session via Init epoch=1 protocol_version=1                                                                                                                                                    
15:17:11 WARN nil key ratchet for self after GetKeyRatchet user_id=143790230640 protocol_version=1
15:17:53 WARN prepareEpoch: resetting MLS session via Init epoch=1 protocol_version=1
15:17:53 WARN nil key ratchet for self after GetKeyRatchet user_id=1437900140987 protocol_version=1
15:17:56 WARN nil key ratchet for user after GetKeyRatchet user_id=353890032735 protocol_version=1

this hits every single time. the sequence:

  1. gateway sends prepare_epoch(1) → Init() resets MLS state
  2. gateway sends prepare_transition(0, 1) right after
  3. setupKeyRatchetForUser runs, GetKeyRatchet() returns nil because no commit/welcome processed yet
  4. without the nil guard → SetPassthroughMode(false) + SetKeyRatchet(nil) → every Encrypt() fails with ErrMissingKeyRatchet until the handshake finishes ~1s later

the nil guard just skips touching the encryptor/decryptor when there are no keys yet. old ratchet stays in place, audio keeps working with previous epoch keys (which receivers retain for up to 10s per the spec), then once the MLS handshake completes everything switches to the new ratchet normally.

also for what it's worth my integration is literally voice.WithDaveSessionCreateFunc(golibdave.NewSession) same as the disgo voice example. nothing custom going on.

appreciate you taking the time to review this btw 🙏

@topi314
Copy link
Copy Markdown
Member

topi314 commented Mar 8, 2026

@MohmmedAshraf can you still reproduce these issues?
I am currently unable to reproduce them at all.
Happy to live test/debug with you

@topi314
Copy link
Copy Markdown
Member

topi314 commented Mar 8, 2026

@RSMCx1 do you have any suggestions on how to reproduce your issue?

@MohmmedAshraf
Copy link
Copy Markdown
Author

Hi @topi314 sure i can help do you have discord or smth so i can share more details? mine is straidar

@topi314
Copy link
Copy Markdown
Member

topi314 commented Mar 8, 2026

just join the disgo discord linked in the readme :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants