Quick Links:
The Project Interview Coach is a real-time voice-based AI interviewer built on LiveKit's infrastructure. The system consists of three main components:
┌─────────────────┐ ┌──────────────────┐ ┌────────────────────┐
│ React Frontend │◄────►│ Token Server │ │ LiveKit Cloud │
│ (TypeScript) │ │ (FastAPI) │ │ │
└─────────────────┘ └──────────────────┘ └────────────────────┘
│ ▲
│ │
│ │
└─────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────┐
│ AI Agent (Python) │
│ - VAD + STT + LLM │
│ - RAG Integration │
│ - Vector Store (Chroma)│
└──────────────────────────┘
-
Frontend (React + TypeScript + Vite)
- LiveKit React Components for WebRTC
- Real-time audio/video handling
- Live transcription display
- Modern, responsive UI
-
Token Server (FastAPI)
- Issues LiveKit access tokens
- Manages session capacity (max 5 concurrent)
- CORS-enabled for local development
-
AI Agent (Python + LiveKit Agents SDK)
- Voice Activity Detection (Silero VAD)
- Speech-to-Text (AssemblyAI)
- LLM (OpenAI GPT-4o-mini)
- Text-to-Speech (Cartesia Sonic)
- RAG system for project documentation
- Function calling for doc search and feedback
-
RAG System (LangChain + ChromaDB)
- PDF document ingestion
- Semantic chunking (1000 chars, 200 overlap)
- OpenAI embeddings
- Deduplication for better retrieval
- User opens frontend → Requests token from token server
- Token server checks capacity → Issues LiveKit token
- Frontend connects to LiveKit room → Agent joins automatically
- Agent greets user → Conversation begins
User Speech
↓
VAD detects end-of-speech (~500ms silence)
↓
STT transcribes (AssemblyAI) → Text sent to LLM
↓
LLM decides: Need project context?
├─ Yes → Calls search_project_docs() tool
│ → RAG retrieves relevant chunks
│ → Chunks added to LLM context
└─ No → Proceeds with conversation
↓
LLM generates response (GPT-4o-mini)
↓
TTS synthesizes (Cartesia) → Audio streamed via LiveKit
↓
User hears response → Cycle repeats
The RAG system is not always active. It's triggered on-demand via function calling:
- Agent receives user answer → LLM analyzes if project-specific knowledge needed
- Function call triggered →
search_project_docs(query)invoked - Semantic search → ChromaDB retrieves top-k chunks (k=4)
- Deduplication → Content fingerprinting removes overlaps
- Context injection → Chunks added to LLM system message
- Follow-up generated → LLM uses retrieved context for informed question
This on-demand approach:
- ✅ Reduces latency (no RAG on every turn)
- ✅ Saves API costs (fewer embedding searches)
- ✅ Keeps context focused (only when needed)
See RAG Chunk Size Decision for why 1000/200 was chosen.
The agent is built on LiveKit Agents SDK using the AgentSession pattern. Core components:
session = AgentSession(
stt="assemblyai/universal-streaming:en", # Real-time transcription
llm="openai/gpt-4o-mini", # Conversation engine
tts="cartesia/sonic-3:...", # Voice synthesis
vad=silero.VAD.load(), # Speech detection
turn_detection=MultilingualModel() # End-of-turn detection
)Lifecycle:
- Connection: Agent joins room when LiveKit triggers
@server.rtc_session()decorator - Initialization: Session starts with room connection + audio pipeline setup
- Greeting: Initial reply generated to establish conversation
- Active: Agent listens for speech → transcribes → LLM → TTS → responds
- Cleanup: Monitor task detects end conditions → graceful shutdown → decrement counter
Key Design:
- Single-threaded per session: Each agent session is isolated (no shared state)
- Event-driven: Reacts to LiveKit room events (
track_subscribed, speech detection) - Async-first: All I/O operations are non-blocking (STT, LLM, TTS)
Concurrency Control:
MAX_CONCURRENT_SESSIONS = 5
active_sessions = 0 # Global counter, protected by asyncio.Lock- Hard limit enforced at session start
- Thread-safe using
asyncio.Lock(no race conditions) - New connections rejected if
active_sessions >= MAX_CONCURRENT_SESSIONS - Count decremented in
finallyblock (guaranteed cleanup)
Idle Timeout Strategy:
IDLE_TIMEOUT = 900s (15 min) # Max session duration
ACTIVITY_CHECK_INTERVAL = 60s # Monitor frequency
EMPTY_ROOM_GRACE_PERIOD = 3s # Connection race buffer
Activity Tracking:
last_activity_timeupdated on every audio track subscription- Background monitor task checks every 60s for:
- Empty room (no non-agent participants)
- Idle timeout (no activity for 15 min)
- Grace period prevents premature cleanup during user reconnections
Cleanup Triggers:
- User leaves room → Empty room detected after 3s grace
- 15 min idle → Timeout triggered
- Connection error → Exception caught in
finallyblock - Server shutdown →
asyncio.CancelledErrorpropagates
Tool Registration:
class Assistant(Agent):
def __init__(self):
super().__init__(
instructions=INTERVIEWER_SYSTEM_PROMPT,
tools=[search_project_docs, generate_feedback] # Auto-registered
)How It Works:
- Tools defined as Python functions with type hints + docstrings
- LiveKit SDK auto-converts to OpenAI function schema
- LLM receives tool schemas in system message
- When LLM returns function call → SDK intercepts → executes Python function
- Function result injected back into LLM context
- LLM generates natural language response using tool output
Tool Execution Flow Example 1: Documentation Search
User: "What does this project do?"
↓
LLM decides: Need project context → Calls search_project_docs("project purpose")
↓
RAG retrieves relevant chunks from ChromaDB
↓
Chunks returned to LLM as tool result
↓
LLM generates answer: "This project is a voice-based interviewer..."
↓
TTS speaks response
Tool Execution Flow Example 2: Feedback Generation
Interview concludes → User asks for feedback
↓
LLM analyzes conversation history → Identifies strengths and areas for improvement
↓
LLM calls generate_feedback(
strengths="Clear explanation of system architecture...",
areas_for_improvement="Could elaborate more on error handling...",
rating=8
)
↓
Tool formats feedback as structured summary
↓
Formatted feedback returned to LLM
↓
LLM delivers feedback naturally: "Great job! You scored 8/10. Your strengths were..."
↓
TTS speaks feedback
Available Tools:
-
search_project_docs(query: str) -> str- Purpose: Retrieve relevant information from project documentation via RAG
- When triggered: LLM needs project-specific context to ask informed questions
- Parameters:
query: Search query string (e.g., "authentication implementation", "API endpoints")
- Returns: Retrieved document chunks or error message
- Behavior: Searches ChromaDB with k=4, deduplicates results, formats as context
- Latency impact: +1-2 seconds per call
-
generate_feedback(strengths: str, areas_for_improvement: str, rating: int) -> str- Purpose: Generate structured end-of-interview feedback summary
- When triggered: Interview concludes or user explicitly requests feedback
- Parameters:
strengths: Positive aspects of interview (what candidate did well)areas_for_improvement: Constructive criticism (where to improve)rating: Numeric score 1-10 (automatically clamped if out of range)
- Returns: Formatted feedback summary with rating, strengths, areas for improvement
- Behavior: Validates rating bounds, formats as readable summary, logs rating
- Example output:
Interview Feedback Summary ========================== RATING: 8/10 STRENGTHS: Clear explanation of system architecture... AREAS FOR IMPROVEMENT: Could elaborate more on error handling...
Key Behaviors:
- Tools are async (non-blocking) but RAG search uses blocking I/O
- No retry logic (fails open - LLM responds without tool data if error)
- Tool calls logged at INFO level for debugging
- Multiple tools can be called in sequence (e.g., search docs → ask question → generate feedback)
- Tool schemas auto-generated from Python type hints + docstrings
Stateless Design:
- No persistent state between sessions
- Each session is self-contained
- Conversation history managed by
AgentSession(in-memory only)
Session-Level State:
last_activity_time = [time.time()] # Mutable container for closure- Used for idle timeout tracking
- Updated via event handler closure
- Scoped to single session (no shared state)
Global State:
active_sessions = 0 # Only global state - protected by lockTrade-offs:
- ✅ Simple, no state synchronization issues
- ✅ Each session independent (crash doesn't affect others)
- ❌ No conversation persistence (sessions can't resume)
- ❌ No cross-session analytics (would need database)
Three-Layer Strategy:
1. Initialization Errors (Fail Fast):
try:
session = AgentSession(...)
except Exception as e:
logger.error(f"Failed to initialize: {e}")
raise # Exit immediately, don't occupy session slot2. Runtime Errors (Log and Continue):
try:
await session.generate_reply(...)
except Exception as e:
logger.error(f"Failed to generate greeting: {e}")
# Continue - agent still functional for conversation3. Cleanup Errors (Always Decrement):
finally:
await decrement_session_count() # Guaranteed to runError Recovery Behaviors:
- STT failure: Session continues, next utterance retried
- LLM timeout: User sees silence, can repeat question
- TTS failure: Text shown but not spoken
- Room disconnect: Monitor task exits → cleanup triggered
- Tool failure: LLM proceeds without tool data
Observability:
- All errors logged with
exc_info=True(full stack traces) - Session lifecycle logged at INFO level
- Tool calls visible in logs
- Active session count tracked
Production Improvements Needed:
- Dead letter queue for failed tool calls
- Metrics (Prometheus): session duration, error rates
- Health check endpoint for load balancer
- Circuit breaker for external APIs (STT/TTS)
- Graceful shutdown (drain in-progress sessions)
Decision: Use LiveKit Agents SDK with pre-configured STT/TTS/LLM providers
Why:
- Abstraction: Handles WebRTC complexities (NAT traversal, echo cancellation, jitter buffering)
- Real-time: Optimized for low-latency voice interactions
- Scalability: Built-in support for concurrent sessions
Trade-offs:
- ✅ Pros: Quick setup, production-ready voice quality, automatic transcription
- ❌ Cons: Vendor lock-in to LiveKit ecosystem, limited customization of audio processing
- Alternative considered: Build custom WebRTC + separate STT/TTS APIs (too complex, more latency)
Why:
- 1000 chars ≈ 250 tokens: Fits well in LLM context without overwhelming it
- 200 overlap: Preserves context across chunk boundaries (critical for technical docs)
- Balance: Large enough to capture complete concepts, small enough for precise retrieval
Trade-offs:
- ✅ Pros: Good balance between context and precision, reduces semantic drift
- ❌ Cons: More chunks = slower ingestion, overlap creates duplicates (solved with deduplication)
- Alternatives:
- 500 chars: Too fragmented for technical explanations
- 2000 chars: Too broad, loses precision in retrieval
Decision: Use first 150 chars as fingerprint to filter duplicate chunks
Why:
- Speed: O(n) time complexity with set lookups
- Effectiveness: Chunk overlap often creates near-duplicates with identical beginnings
- Simplicity: No need for fuzzy matching or embeddings comparison
Trade-offs:
- ✅ Pros: Fast, deterministic, catches most overlap duplicates
- ❌ Cons: Misses duplicates with different opening sentences
- Alternative considered: Cosine similarity on embeddings (too slow, unnecessary for this use case)
Decision: Hard limit of 5 concurrent interview sessions
Why:
- Cost control: STT/TTS APIs charge per minute (AssemblyAI, Cartesia)
- Quality: Prevents resource exhaustion that could degrade all sessions
- Quota limits: OpenAI rate limits could impact performance at scale
Trade-offs:
- ✅ Pros: Predictable costs, reliable performance, prevents quota exhaustion
- ❌ Cons: Users may get "capacity reached" errors during peak times
- Production improvement: Add queue system or dynamic scaling
Decision:
- Empty room grace period: 3 seconds
- Idle timeout: 15 minutes
- Activity check interval: 60 seconds
Why:
- Resource cleanup: Prevents abandoned sessions from consuming API quotas
- Grace period: Avoids race conditions when user is connecting
- Idle timeout: Safety net for hung connections or forgotten sessions
Trade-offs:
- ✅ Pros: Automatic cleanup, prevents cost leaks, resilient to network issues
- ❌ Cons: Long-winded conversations get cut off (15 min is generous)
- Alternative: User-initiated end only (risky - abandoned sessions waste money)
Decision: Use GPT-4o-mini for agent conversations
Why:
- Speed: Lower latency for better conversational flow
- Cost: ~10x cheaper than GPT-4
- Capability: Sufficient for structured interviews with RAG context
Trade-offs:
- ✅ Pros: Fast responses (better UX), cost-effective for many sessions
- ❌ Cons: Less nuanced follow-ups compared to GPT-4, may miss subtle technical cues
- When to use GPT-4: Complex code review interviews, senior-level technical depth
Decision: Show interim transcriptions immediately, replace with final when available
Why:
- UX: User sees text appear instantly as they speak (feels responsive)
- Accuracy: Final transcriptions correct ASR errors
- Timeout fallback: Auto-finalize interim after 3 seconds (handles missing finals)
Trade-offs:
- ✅ Pros: Immediate feedback, accurate final text, handles edge cases
- ❌ Cons: Text may flicker during interim→final transition, complex state management
- Alternative: Final only (simpler but feels laggy)
Decision: Use Vite for frontend bundling
Why:
- Speed: Instant HMR during development
- Simplicity: No server-side rendering needed for this use case
- LiveKit compatibility: Official React components work out of the box
Trade-offs:
- ✅ Pros: Fast dev experience, simple deployment (static files), no SSR overhead
- ❌ Cons: No built-in API routes (separate token server needed)
- Alternative: Next.js would add unnecessary complexity here
Decision: Use ChromaDB with local persistence
Why:
- Simplicity: Embedded database, no external service
- Cost: Free for local development
- Sufficient: Handles small-to-medium doc collections well
Trade-offs:
- ✅ Pros: Zero setup, no API keys, works offline, fast for small datasets
- ❌ Cons: Not suitable for production scale, limited query features
- Production migration: Switch to Pinecone/Weaviate for thousands of documents
Decision: If capacity check fails, allow connection anyway
Why:
- Availability: Technical issues shouldn't block users unnecessarily
- Graceful degradation: Better UX than hard failures
Trade-offs:
- ✅ Pros: More resilient, better user experience during API issues
- ❌ Cons: Could exceed intended limits if LiveKit API is down
- Alternative: Fail closed (safer but frustrating for users)
User speaks → VAD detects end: ~500ms
STT transcription: ~1-2s
LLM generates response: ~1-3s (with RAG ~2-4s)
TTS synthesis: ~500ms
Total turn latency: ~3-7s
- RAG retrieval: Adds 1-2s when agent calls search_project_docs
- Cold start: First LLM response is slower (~5s) due to model loading
- Network: WebRTC latency depends on user location vs LiveKit edge
- Deduplication reduces LLM context size by ~30%
- Concurrent audio processing (VAD runs while STT transcribes)
- Cached embeddings model in RAG retrieval
- Token server allows all origins (CORS
*) - No authentication on token endpoint
- LiveKit tokens expire after default period
- No rate limiting on token issuance
- Authentication: Add OAuth/JWT to token server
- Rate limiting: Prevent token farming
- CORS: Restrict to known frontend domains
- Token TTL: Short-lived tokens (1 hour)
- Input validation: Sanitize room names, usernames
- Secrets management: Use vault/KMS for API keys
- Voice quality verification
- Transcription accuracy
- UI responsiveness
- Cross-browser compatibility (Chrome, Safari, Firefox)
- LiveKit transcriptions are complex: Needed interim/final handling + deduplication + timeouts
- RAG needs tuning: Generic chunk sizes don't work well for technical docs
- Concurrency is critical: Without limits, costs can spiral quickly
- Error handling matters: Silent failures in voice apps are terrible UX
- Logging strategy: Too much logging slows down, too little makes debugging impossible
- Run token server + agent + frontend locally
- Use
.env.localfor secrets - Single developer workflow
- Deploy token server to Railway/Fly.io
- Agent runs on dedicated VM (better for long-running processes)
- Frontend on Vercel/Netlify
- Use managed ChromaDB or Pinecone
- Kubernetes for agent horizontal scaling
- Redis for session state management
- PostgreSQL for interview storage
- CloudFlare for frontend CDN
- Grafana for monitoring
This design prioritizes developer experience and cost efficiency for a prototype/MVP. The architecture is suitable for:
- Up to 100 interviews/day
- Single-tenant deployment
- Development/testing environments
For production at scale, consider:
- Managed vector database (Pinecone)
- Separate STT/TTS services (lower cost providers)
- Load balancing and auto-scaling
- Monitoring and alerting
- Database for persistence
The core RAG + voice agent pattern is solid and can scale with infrastructure improvements.