Skip to content

Implementation Details of Lookahead Audio Encoder (Rope-Wav2vec2) #3

Description

@halsstarrig243

Thank you for the excellent work on DyStream! I have a question regarding the lookahead audio encoder mentioned in Section 4.6 of the paper.

In the paper, you describe a Rope-Wav2vec2 encoder with 60ms lookahead that achieves a Sync-C score of 7.119 for streaming generation. The paper mentions:

"2. Train a causal encoder with distillation. We modified our Wav2Vec2 encoder to make it causal, noted as Rope-Wav2vec2, and first train a the full-attention version, then distill a causal version."

"3. Train a causal encoder with lookahead. We operate on the attention mask to allow Rope-Wav2vec2 could use a few future audio features."

Table 3 shows impressive results with different lookahead times:

Lookahead Time (Train) Lookahead Time (Inference) Sync-C
0 0 3.017
20 20 5.976
40 40 6.896
60 60 7.119

Questions

After examining the released code, I noticed that the current implementation uses WrapedWav2Vec (line 229-256 in motion_gen_gpt_flowmatching_addaudio_linear_twowavencoder.py), which wraps the standard facebook/wav2vec2-base-960h model. The make_attention_causal function modifies the attention layers but currently has is_causal=False.

I have several questions about the lookahead audio encoder implementation:

1. Attention Mask Design

How is the lookahead attention mask implemented in your Rope-Wav2vec2?

  • Is it applied to all encoder layers or only specific layers?
  • For 60ms lookahead (approximately 3 frames at 50fps output from wav2vec2), how is the attention mask structured for each token?

2. Distillation Process

Could you share more details about the distillation training process?

  • What is the distillation loss function (e.g., MSE on hidden states, KL divergence on output distributions)?
  • Is the distillation performed layer-by-layer or only on the final output?
  • What are the training hyperparameters (learning rate, epochs, dataset)?

3. RoPE Integration

The paper mentions "Rope-Wav2vec2" - does this mean you replaced the positional encoding with Rotary Position Embedding (RoPE)?

  • If so, how does RoPE interact with the lookahead attention mask?
  • Is RoPE only applied to the audio encoder or also to the motion generation model?

4. Code Release Plan

Do you have plans to release the following components?

  • Rope-Wav2vec2 model implementation with lookahead attention
  • Distillation training script and configuration
  • Pretrained lookahead audio encoder weights

Current Code Analysis

In the released code:

# motion_gen_gpt_flowmatching_addaudio_linear_twowavencoder.py

def make_attention_causal(attn: Wav2Vec2Attention):
    # ...
    y = F.scaled_dot_product_attention(
        q, k, v,
        dropout_p=p if self.training else 0.0,
        is_causal=False,  # Currently set to False
    )
    # ...

class WrapedWav2Vec(nn.Module):
    def __init__(self, layers: int = 1):
        base = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")
        # ...
        for l in self.encoder.layers:
            make_attention_causal(l.attention)

It appears the current implementation doesn't include the lookahead mechanism described in the paper. Would it be possible to update the repository with the streaming-ready version?

Thank You

I appreciate your time in addressing these questions. This work is very valuable for the talking head generation community, and understanding the streaming audio encoder implementation would help others build upon your work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions