Implementation Details of Lookahead Audio Encoder (Rope-Wav2vec2)

Thank you for the excellent work on DyStream! I have a question regarding the **lookahead audio encoder** mentioned in Section 4.6 of the paper.

In the paper, you describe a **Rope-Wav2vec2** encoder with **60ms lookahead** that achieves a Sync-C score of **7.119** for streaming generation. The paper mentions:

> "2. Train a causal encoder with distillation. We modified our Wav2Vec2 encoder to make it causal, noted as Rope-Wav2vec2, and first train a the full-attention version, then distill a causal version."
>
> "3. Train a causal encoder with lookahead. We operate on the attention mask to allow Rope-Wav2vec2 could use a few future audio features."

Table 3 shows impressive results with different lookahead times:
| Lookahead Time (Train) | Lookahead Time (Inference) | Sync-C |
|------------------------|----------------------------|--------|
| 0 | 0 | 3.017 |
| 20 | 20 | 5.976 |
| 40 | 40 | 6.896 |
| 60 | 60 | 7.119 |

## Questions

After examining the released code, I noticed that the current implementation uses `WrapedWav2Vec` (line 229-256 in `motion_gen_gpt_flowmatching_addaudio_linear_twowavencoder.py`), which wraps the standard `facebook/wav2vec2-base-960h` model. The `make_attention_causal` function modifies the attention layers but currently has `is_causal=False`.

I have several questions about the lookahead audio encoder implementation:

### 1. Attention Mask Design
How is the lookahead attention mask implemented in your Rope-Wav2vec2?
- Is it applied to **all encoder layers** or only **specific layers**?
- For 60ms lookahead (approximately 3 frames at 50fps output from wav2vec2), how is the attention mask structured for each token?

### 2. Distillation Process
Could you share more details about the distillation training process?
- What is the distillation loss function (e.g., MSE on hidden states, KL divergence on output distributions)?
- Is the distillation performed layer-by-layer or only on the final output?
- What are the training hyperparameters (learning rate, epochs, dataset)?

### 3. RoPE Integration
The paper mentions "Rope-Wav2vec2" - does this mean you replaced the positional encoding with Rotary Position Embedding (RoPE)?
- If so, how does RoPE interact with the lookahead attention mask?
- Is RoPE only applied to the audio encoder or also to the motion generation model?

### 4. Code Release Plan
Do you have plans to release the following components?
- [ ] Rope-Wav2vec2 model implementation with lookahead attention
- [ ] Distillation training script and configuration
- [ ] Pretrained lookahead audio encoder weights

## Current Code Analysis

In the released code:
```python
# motion_gen_gpt_flowmatching_addaudio_linear_twowavencoder.py

def make_attention_causal(attn: Wav2Vec2Attention):
    # ...
    y = F.scaled_dot_product_attention(
        q, k, v,
        dropout_p=p if self.training else 0.0,
        is_causal=False,  # Currently set to False
    )
    # ...

class WrapedWav2Vec(nn.Module):
    def __init__(self, layers: int = 1):
        base = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")
        # ...
        for l in self.encoder.layers:
            make_attention_causal(l.attention)
```

It appears the current implementation doesn't include the lookahead mechanism described in the paper. Would it be possible to update the repository with the streaming-ready version?

## Thank You

I appreciate your time in addressing these questions. This work is very valuable for the talking head generation community, and understanding the streaming audio encoder implementation would help others build upon your work.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation Details of Lookahead Audio Encoder (Rope-Wav2vec2) #3

Questions

1. Attention Mask Design

2. Distillation Process

3. RoPE Integration

4. Code Release Plan

Current Code Analysis

Thank You

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Implementation Details of Lookahead Audio Encoder (Rope-Wav2vec2) #3

Description

Questions

1. Attention Mask Design

2. Distillation Process

3. RoPE Integration

4. Code Release Plan

Current Code Analysis

Thank You

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions