Summary
Upgrade the existing Parakeet ASR integration from opt-in streaming prototype to production-grade provider covering both streaming and batch paths, with proper GPU deployment and native diarization. Target: replace Deepgram for primary STT at 100-200x lower cost.
Cost justification
Current Deepgram spend (May 2026, from mon):
- Gross: $40,175/mo | Net (with 1:1 match credits): $20,088/mo
- Usage: 68,081 hrs/mo (50,887 streaming + 17,194 batch)
- Rate breakdown: base $28K + diarize $6.1K + keyterm $5.4K + multilingual $0.6K
- ASR = 19.8% of total backend costs ($101.6K/mo)
- Match credits will expire — gross becomes actual bill
Self-hosted Parakeet estimate:
- 68,081 hrs/mo at
$0.002/hr = **$136/mo** compute
- L4 GPU instance: $350-500/mo (existing GKE has a T4 at $1,204/mo that could potentially be reused)
- Projected savings: $19K-39K/mo depending on credit expiry timing
Current state (already in main)
What exists
-
Streaming socket (utils/stt/streaming.py:780-1013): ParakeetStreamingSocket implements the STTSocket ABC. Buffers PCM16 in 6s windows, POSTs to /v1/transcribe, returns segment dicts. Online speaker diarization via embedding-based greedy clustering (cosine threshold 0.45).
-
Parakeet microservice (backend/parakeet/): FastAPI service loading NeMo parakeet-tdt-0.6b-v3 (multilingual, 25 European languages). Single endpoint POST /v1/transcribe accepts WAV file upload, returns {text, segments[{text, start, end}]}. Auth via ENCRYPTION_SECRET bearer token. Dockerfile with CUDA 13.2 runtime.
-
Routing (routers/transcribe.py:358-359): Opt-in only — client must request stt_service=parakeet AND HOSTED_PARAKEET_API_URL must be set. Never auto-switches from Deepgram.
-
STTService enum (streaming.py:39): parakeet is already a member alongside deepgram and modulate.
What's missing for production
-
No PrerecordedSTTProvider implementation — Parakeet has no batch path. The pre_recorded.py factory only routes to Deepgram or Modulate. 9 callsites use batch transcription:
utils/chat.py (6 callsites): voice messages, desktop PTT, desktop transcribe
utils/conversations/postprocess_conversation.py (1): re-transcription
routers/sync.py (1): sync-local-files
utils/speaker_sample.py (1): speaker sample verification
-
No NIM container deployment — current Dockerfile uses raw NeMo (nemo_toolkit[asr]), not the production-optimized NVIDIA NIM container (nvcr.io/nim/nvidia/parakeet-1-1b-rnnt-multilingual) which includes TensorRT + Triton for higher throughput.
-
No native diarization — Parakeet returns no speaker labels. Current workaround embeds each segment via external HOSTED_SPEAKER_EMBEDDING_API_URL and does greedy clustering. This adds latency and depends on a separate service. Nemotron 3.5 ASR (0.6B, June 2026) may include native diarization — needs investigation.
-
No language routing in factory — get_stt_service_for_language() has no Parakeet entry. It can't be selected via STT_SERVICE_MODELS env var like Deepgram/Modulate — only via explicit client opt-in.
-
Model outdated — currently loads parakeet-tdt-0.6b-v3 (25 European languages). Nemotron 3.5 ASR 0.6B supports 40 languages including Vietnamese and Mandarin with 80ms streaming latency and native punctuation — better fit for Omi's multilingual user base.
Proposed work
Phase 1: Batch path + factory routing
- Implement
ParakeetPrerecordedProvider (implements PrerecordedSTTProvider ABC)
- Add
parakeet routing to get_prerecorded_service() and get_stt_service_for_language()
- Support
STT_SERVICE_MODELS=parakeet and STT_PRERECORDED_MODEL=parakeet env vars
- All 9 batch callsites work without code changes (provider-agnostic ABC)
Phase 2: NIM container deployment
- Switch from raw NeMo Dockerfile to NVIDIA NIM container (
nvcr.io/nim/nvidia/parakeet-1-1b-rnnt-multilingual)
- TensorRT optimization for ~238x real-time throughput on L4
- Evaluate: deploy on existing GKE GPU node pool vs new dedicated instance
- Health checks, autoscaling, resource limits
Phase 3: Model upgrade + diarization
- Evaluate Nemotron 3.5 ASR 0.6B (40 languages, native punctuation, 80ms streaming)
- Investigate native speaker diarization support — if absent, integrate NeMo MSDD or pyannote as sidecar
- Test CJK character handling (Deepgram splits with spaces — Parakeet behavior unknown)
- Test code-switching quality (language mixing within utterance)
Phase 4: Gradual rollout
- A/B test via
STT_SERVICE_MODELS=parakeet,dg-nova-3 (Parakeet primary, DG fallback)
- Monitor: WER, first-segment latency (<10s target, <30s max), diarization accuracy
- Track DG usage decline → eventual deprecation of DG as primary
Architecture reference
STTSocket (ABC) PrerecordedSTTProvider (ABC)
├─ SafeDeepgramSocket ├─ DeepgramPrerecordedProvider
├─ SafeModulateSocket ├─ ModulatePrerecordedProvider
└─ ParakeetStreamingSocket ✅ └─ ParakeetPrerecordedProvider ❌ (missing)
Streaming factory: Batch factory:
get_stt_service_for_language() get_prerecorded_service()
→ STT_SERVICE_MODELS env var → STT_PRERECORDED_MODEL env var
→ routes: dg-nova-3, modulate → routes: dg-nova-3, modulate
→ parakeet: opt-in only ⚠️ → parakeet: not supported ❌
Environment variables (current)
| Var |
Default |
Purpose |
HOSTED_PARAKEET_API_URL |
(unset) |
Parakeet service URL |
PARAKEET_WINDOW_SECONDS |
6.0 |
Streaming buffer window |
PARAKEET_DIARIZATION |
1 |
Enable embedding-based diarization |
PARAKEET_MODEL |
nvidia/parakeet-tdt-0.6b-v3 |
NeMo model name |
HOSTED_SPEAKER_EMBEDDING_API_URL |
(unset) |
Required for Parakeet diarization |
Open questions
- Does Nemotron 3.5 ASR support speaker diarization natively? If not, what's the best sidecar (NeMo MSDD vs pyannote)?
- NIM container vs raw NeMo — latency/throughput benchmarks on L4?
- Where to host GPU workload — existing GKE T4 node pool ($1,204/mo already running) or new L4 instance?
- Code-switching quality (e.g. Vietnamese-English mixing) — Parakeet is reportedly weak here vs Modulate
- CJK tokenization — how does Parakeet handle word boundaries for Chinese/Japanese/Korean?
Best model candidates (from geni research)
| Model |
Languages |
WER |
Speed |
Params |
Notes |
| Nemotron 3.5 ASR 0.6B |
40 (incl. vi, zh) |
— |
80ms streaming |
0.6B |
Recommended — native punctuation, CC-BY-4.0 |
| Parakeet-TDT-0.6b-v2 |
English only |
6.05% |
3386x RT |
0.6B |
Best English accuracy |
| Parakeet-TDT-0.6b-v3 |
25 European |
6.32% |
— |
0.6B |
Currently loaded in main |
Related
Status Update (2026-06-09)
Solved ✅ (PR #7653)
| Feature |
Status |
Batch ASR (/v1/transcribe) |
✅ Prod — TDT 0.6b, 0.1% WER, full punctuation |
Batch + diarization (/v2/transcribe) |
✅ Prod — built-in pyannote/wespeaker on GPU |
Streaming ASR (/v3/stream) |
✅ Prod (opt-in) — RNNT 1.1b, 5s segments, AGC |
| Pre-recorded routing |
✅ Prod — STT_PRERECORDED_MODEL=parakeet,dg-nova-3 with language fallback |
| Backend integration |
✅ ParakeetPrerecordedProvider + ParakeetWebSocketSocket |
| Language fallback |
✅ Parakeet for 25 EU langs, auto-fallback to Deepgram for CJK/Hindi/Korean |
| BF16 dual-model on L4 GPU |
✅ TDT + RNNT in 12.6Gi/20Gi |
| GPU concurrency safety |
✅ Semaphore serializes CUDA access |
| Helm chart + GitHub Actions workflow |
✅ gcp_parakeet.yml |
| E2E device test (Omi BLE → app UI) |
✅ Verified on Pixel 7a |
Remaining
| Feature |
Priority |
Details |
| Streaming as default |
HIGH |
Needs punctuation fix (RNNT outputs lowercase). Plan: interim RNNT → final TDT re-transcribe |
| Request batching |
HIGH |
Current: 1-at-a-time (semaphore). Target: batch_size=4-8 for ~4x throughput |
| NeMo thread-safety fix |
HIGH |
NeMo #15771 — model.transcribe() freeze/unfreeze race confirmed as bug, not intentional. New OSS ASR worker patching upstream |
| Split batch/stream semaphores |
MEDIUM |
Batch and streaming currently block each other |
| NIM or Triton migration |
MEDIUM |
Raw NeMo + FastAPI is prototype-grade per NVIDIA. NIM handles batching, streaming, concurrency natively |
| Nemotron Speech 0.6B |
LOW |
Half memory vs RNNT 1.1b, designed for streaming, better concurrent support |
| TensorRT export |
LOW |
~2x faster inference, ~50% less memory |
Prod Metrics (T+24h, stable)
- 7,005 requests served, 0% error rate (post-semaphore fix)
- v1=2.88s, v2=8.65s latency
- 0 pod restarts, 13.0Gi memory (no leak)
Summary
Upgrade the existing Parakeet ASR integration from opt-in streaming prototype to production-grade provider covering both streaming and batch paths, with proper GPU deployment and native diarization. Target: replace Deepgram for primary STT at 100-200x lower cost.
Cost justification
Current Deepgram spend (May 2026, from mon):
Self-hosted Parakeet estimate:
$0.002/hr = **$136/mo** computeCurrent state (already in main)
What exists
Streaming socket (
utils/stt/streaming.py:780-1013):ParakeetStreamingSocketimplements theSTTSocketABC. Buffers PCM16 in 6s windows, POSTs to/v1/transcribe, returns segment dicts. Online speaker diarization via embedding-based greedy clustering (cosine threshold 0.45).Parakeet microservice (
backend/parakeet/): FastAPI service loading NeMoparakeet-tdt-0.6b-v3(multilingual, 25 European languages). Single endpointPOST /v1/transcribeaccepts WAV file upload, returns{text, segments[{text, start, end}]}. Auth viaENCRYPTION_SECRETbearer token. Dockerfile with CUDA 13.2 runtime.Routing (
routers/transcribe.py:358-359): Opt-in only — client must requeststt_service=parakeetANDHOSTED_PARAKEET_API_URLmust be set. Never auto-switches from Deepgram.STTService enum (
streaming.py:39):parakeetis already a member alongsidedeepgramandmodulate.What's missing for production
No
PrerecordedSTTProviderimplementation — Parakeet has no batch path. Thepre_recorded.pyfactory only routes to Deepgram or Modulate. 9 callsites use batch transcription:utils/chat.py(6 callsites): voice messages, desktop PTT, desktop transcribeutils/conversations/postprocess_conversation.py(1): re-transcriptionrouters/sync.py(1): sync-local-filesutils/speaker_sample.py(1): speaker sample verificationNo NIM container deployment — current Dockerfile uses raw NeMo (
nemo_toolkit[asr]), not the production-optimized NVIDIA NIM container (nvcr.io/nim/nvidia/parakeet-1-1b-rnnt-multilingual) which includes TensorRT + Triton for higher throughput.No native diarization — Parakeet returns no speaker labels. Current workaround embeds each segment via external
HOSTED_SPEAKER_EMBEDDING_API_URLand does greedy clustering. This adds latency and depends on a separate service. Nemotron 3.5 ASR (0.6B, June 2026) may include native diarization — needs investigation.No language routing in factory —
get_stt_service_for_language()has no Parakeet entry. It can't be selected viaSTT_SERVICE_MODELSenv var like Deepgram/Modulate — only via explicit client opt-in.Model outdated — currently loads
parakeet-tdt-0.6b-v3(25 European languages). Nemotron 3.5 ASR 0.6B supports 40 languages including Vietnamese and Mandarin with 80ms streaming latency and native punctuation — better fit for Omi's multilingual user base.Proposed work
Phase 1: Batch path + factory routing
ParakeetPrerecordedProvider(implementsPrerecordedSTTProviderABC)parakeetrouting toget_prerecorded_service()andget_stt_service_for_language()STT_SERVICE_MODELS=parakeetandSTT_PRERECORDED_MODEL=parakeetenv varsPhase 2: NIM container deployment
nvcr.io/nim/nvidia/parakeet-1-1b-rnnt-multilingual)Phase 3: Model upgrade + diarization
Phase 4: Gradual rollout
STT_SERVICE_MODELS=parakeet,dg-nova-3(Parakeet primary, DG fallback)Architecture reference
Environment variables (current)
HOSTED_PARAKEET_API_URLPARAKEET_WINDOW_SECONDSPARAKEET_DIARIZATIONPARAKEET_MODELHOSTED_SPEAKER_EMBEDDING_API_URLOpen questions
Best model candidates (from geni research)
Related
backend/parakeet/— existing microservicebackend/diarizer/— existing pyannote speaker diarization serviceStatus Update (2026-06-09)
Solved ✅ (PR #7653)
/v1/transcribe)/v2/transcribe)/v3/stream)STT_PRERECORDED_MODEL=parakeet,dg-nova-3with language fallbackParakeetPrerecordedProvider+ParakeetWebSocketSocketgcp_parakeet.ymlRemaining
batch_size=4-8for ~4x throughputmodel.transcribe()freeze/unfreeze race confirmed as bug, not intentional. New OSS ASR worker patching upstreamProd Metrics (T+24h, stable)