Add support for GPT-Realtime-2.0#5690
Conversation
tinalenguyen
left a comment
There was a problem hiding this comment.
tested it and it works well, though i think the start/stop speaking_at times recorded in the chat messages aren't accurate 🤔 all messages in the same generation would have the same started_speaking_at timestamps for the chatmessages
| msg_tasks.clear() | ||
|
|
||
| if audio_output is not None and audio_out is not None: | ||
| await audio_output.wait_for_playout() |
There was a problem hiding this comment.
should we call perform_audio_forwarding once for multiple segment instead of calling it multiple times in one response, like merge the streams from generation_ev.message_stream? in that way we don't need to change the output and synchronizer?
There was a problem hiding this comment.
btw, I didn't see multiple messages in a single response during my testing, is there a way to trigger that or it's just random?
update: I saw multiple segment with asking it to do so, like tell me a story with two parts
Summary
Update realtime agent output handling for Realtime 2.0 responses and fix transcript synchronization races around overlapping segment lifecycle events.
Realtime 2.0 can produce multiple message items for a single response. The Agents output stack exposes playback through a shared
AudioOutputsegment contract, so this PR forwards realtime message outputs sequentially through the sink. That keeps playback-start/playback-finished events attributable to the correct message and avoids adding or truncating the wrong assistant item during interruption.This PR also hardens
TranscriptSynchronizersegment handling when audio/text flush timing is not perfectly aligned.Changes
Compatibility
This is intended to be public-API compatible. It does not change the
AudioOutputorTextOutputinterfaces or event payload shapes.Existing realtime models emitted a single message item per response, so the sequential forwarding path preserves prior behavior while correctly handling the new Realtime 2.0 multi-message response shape.
The transcript synchronizer changes are internal lifecycle fixes. The main observable timing change is that a non-interrupted synced
playback_finishedevent may wait for text/audio inputs to complete when playback finishes before transcript input drains. Interrupted and already-ready playback events remain synchronous.Testing
make checkmake unit-testsuv run pytest tests/test_agent_session.py tests/test_transcript_synchronizer.py -qgit diff --checkmake unit-testsresult: 635 passed, 2 skipped