Nova Sonic 2.0 Proactive Speech (SYSTEM_SPEECH) and Cross-Modal Interactive Input

### Feature Type

I cannot use LiveKit without it

### Feature Description

Amazon Nova Sonic 2.0 introduces two powerful new capabilities that are not currently supported in the AWS realtime plugin:

1. **SYSTEM_SPEECH Role** - A new role that allows the system to inject content that the assistant will **speak aloud**. Unlike the existing `SYSTEM` role (which provides silent instructions), `SYSTEM_SPEECH` content is vocalized by the assistant.

2. **Cross-Modal Interactive Input** - The ability to inject text messages during an active voice session using `interactive: true`, enabling mixed audio and text input in the same conversation.

### Use Cases

**Proactive Speech (SYSTEM_SPEECH):**
- Proactively informing users of events or notifications without waiting for user input
- Injecting context for the assistant to announce (e.g., "Your meeting starts in 5 minutes")
- Triggering assistant speech based on external events or tool results

**Cross-Modal Interactive Input:**
- Injecting text-based context mid-conversation (e.g., data from a database query)
- Sending structured input alongside voice
- Providing tool results or external data as text during voice sessions

### AWS Documentation References

- [Amazon Nova Sonic Input Events](https://docs.aws.amazon.com/nova/latest/userguide/speech-input-events.html) - Documents the `SYSTEM_SPEECH` role and `interactive` parameter
- [Using Amazon Nova Sonic Speech-to-Speech model](https://docs.aws.amazon.com/nova/latest/userguide/speech.html) - Overview of Nova Sonic capabilities
- [Amazon Nova Sonic Prompting Best Practices](https://docs.aws.amazon.com/nova/latest/userguide/prompting-speech.html) - Guidance on system prompts

### Proposed API

```python
from livekit.plugins.aws.experimental.realtime import RealtimeModel, RealtimeSession

# Create Nova Sonic 2.0 session
model = RealtimeModel.with_nova_sonic_2()
session = model.session()

# Proactive speech - assistant speaks this aloud
await session.inject_system_speech("You have a new message from John.")

# Cross-modal text input during voice session
await session.send_text_input("What's the weather in Seattle?")
```

Implementation Notes

- These features are Nova Sonic 2.0 only (amazon.nova-2-sonic-v1:0)
- The existing ROLE type needs to be extended to include SYSTEM_SPEECH
- Version gating should prevent usage with Nova Sonic 1.0

Environment

- Plugin: livekit-plugins-aws
- Module: livekit.plugins.aws.experimental.realtime
- Nova Sonic version: 2.0 (amazon.nova-2-sonic-v1:0)


### Workarounds / Alternatives

_No response_

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nova Sonic 2.0 Proactive Speech (SYSTEM_SPEECH) and Cross-Modal Interactive Input #4574

Feature Type

Feature Description

Use Cases

AWS Documentation References

Proposed API

Workarounds / Alternatives

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Nova Sonic 2.0 Proactive Speech (SYSTEM_SPEECH) and Cross-Modal Interactive Input #4574

Description

Feature Type

Feature Description

Use Cases

AWS Documentation References

Proposed API

Workarounds / Alternatives

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions