This project demonstrates an OpenAI Voice Agent that supports speech-to-speech interaction with text integration.
- Uses
gpt-4o-mini-realtime-previewfor real-time voice conversations. - Uses
gpt-4o-mini-transcribefor converting user speech into text. - Both text and speech from the user and the assistant are displayed in the UI.
- This is a demo template, not production-ready.
gpt-4o-mini-transcribeandgpt-4o-mini-realtime-previewdo not sync perfectly.- OpenAI currently does not support combining them seamlessly.
- Real-time WebRTC connection to OpenAI Realtime API.
- Audio streaming for bidirectional voice communication.
- Text transcripts of both user queries and assistant responses.
- Microphone toggle to start/stop audio capture.
- Session lifecycle management (start/stop, cleanup).
| Component | Location | Purpose |
|---|---|---|
| realtime-voice-agent | This repo | Complete frontend + WebRTC integration |
| Backend | voice-agent-core | Backend API with batch + realtime endpoints |
- React (UI + hooks)
- WebRTC (peer connection, data channels)
- OpenAI Realtime API
- TypeScript (for type safety)
- TailwindCSS (styling)
- React Router (navigation)
The agent supports OpenAI's function calling capabilities for enhanced interactivity. Current implementation includes a simple page navigation:
export const toolsDefinition = [
{
name: "navigate",
description: "Navigate to a specific page in the app",
parameters: {
type: "object",
properties: {
page: {
type: "string",
enum: ["/", "/about"],
description: "Page path to navigate to",
},
},
required: ["page"],
},
},
];
export const TOOLS = toolsDefinition.map((tool) => ({
type: "function" as const,
...tool,
}));Assistant Instructions:
export const INSTRUCTIONS = `
Tool:
- navigate(page: "/", "/about") → switch app page.
Use the tool only when user requests a page change. Otherwise, answer normally.
`;For comprehensive function calling documentation, refer to: OpenAI Function Calling Guide
The voice agent follows a structured initialization and communication flow:
-
Ephemeral Token Generation:
- Fetch session credentials from
localhost:8080/api/realtime/session - Extract
client_secret.valuefor authentication
- Fetch session credentials from
-
WebRTC Setup:
- Create
RTCPeerConnectioninstance - Configure audio element for playback with
autoplayenabled - Set up
ontrackhandler for incoming audio streams - Capture user microphone stream via
getUserMedia() - Add microphone tracks to peer connection
- Create
-
Session Initialization:
- Create data channel (
oai-events) for event communication - Generate SDP offer using
createOffer() - Exchange SDP with OpenAI Realtime API endpoint
- Configure session with tools, instructions, and transcription model
- Create data channel (
-
Active Communication:
- Real-time bidirectional voice interaction
- Event-driven message handling via WebRTC data channel
- Microphone toggle functionality (start/stop recording)
- Live transcript generation and display
-
Session Cleanup:
- Close data channel and peer connection
- Stop all media tracks and release microphone access
- Reset all state variables and references
The speech model (gpt-4o-mini-realtime-preview) and transcription model (gpt-4o-mini-transcribe) operate as separate services without shared session identifiers. This architectural limitation creates race conditions and message ordering issues in the user-agent conversation flow.
Implementation of a completion state manager that tracks both model responses and only advances the conversation cycle after receiving final events from both services, ensuring proper message sequencing and chat history integrity.
This approach ensures:
- Message Ordering: Prevents conversation flow disruption
- State Consistency: Maintains chat history integrity
- Race Condition Mitigation: Synchronizes asynchronous model responses
The system handles multiple WebRTC data channel events:
response.audio_transcript.delta: Live bot response transcript chunksconversation.item.input_audio_transcription.delta: Live user speech transcriptionresponse.done: Signals completion of assistant response (voice + potential function calls)conversation.item.input_audio_transcription.completed: User transcription finalization
const sessionUpdate = {
type: "session.update",
session: {
tools: TOOLS,
instructions: INSTRUCTIONS,
"input_audio_transcription": {
"model": "gpt-4o-mini-transcribe"
},
"max_response_output_tokens": 300
},
};For complete session object specifications and available parameters, refer to the OpenAI Realtime Session Object Documentation
- Detection: System identifies function call in
response.doneevent - Parsing: Extract function name and arguments from response
- Execution: Execute corresponding action (e.g., page navigation)
- Feedback: Send completion confirmation back to the model
- Response Generation: Trigger new model response if required
- Start session → Creates a WebRTC peer connection with OpenAI.
- Start microphone → Captures user voice.
- Send audio → Transmitted to the Realtime API.
- Receive response →
- Assistant replies in voice.
- Transcript is displayed as text.
- Function calls executed when requested (e.g., navigation).
- Stop session → Cleans up peer connection and tracks.
This is an experimental demo meant for learning and exploration. It is not optimized for production environments. Expect synchronization issues between voice and text responses due to OpenAI's current API limitations.