Skip to content

feat: production-grade NVIDIA Parakeet/Nemotron ASR — batch path, NIM deployment, diarization #7651

@beastoin

Description

@beastoin

Summary

Upgrade the existing Parakeet ASR integration from opt-in streaming prototype to production-grade provider covering both streaming and batch paths, with proper GPU deployment and native diarization. Target: replace Deepgram for primary STT at 100-200x lower cost.

Cost justification

Current Deepgram spend (May 2026, from mon):

  • Gross: $40,175/mo | Net (with 1:1 match credits): $20,088/mo
  • Usage: 68,081 hrs/mo (50,887 streaming + 17,194 batch)
  • Rate breakdown: base $28K + diarize $6.1K + keyterm $5.4K + multilingual $0.6K
  • ASR = 19.8% of total backend costs ($101.6K/mo)
  • Match credits will expire — gross becomes actual bill

Self-hosted Parakeet estimate:

  • 68,081 hrs/mo at $0.002/hr = **$136/mo** compute
  • L4 GPU instance: $350-500/mo (existing GKE has a T4 at $1,204/mo that could potentially be reused)
  • Projected savings: $19K-39K/mo depending on credit expiry timing

Current state (already in main)

What exists

  1. Streaming socket (utils/stt/streaming.py:780-1013): ParakeetStreamingSocket implements the STTSocket ABC. Buffers PCM16 in 6s windows, POSTs to /v1/transcribe, returns segment dicts. Online speaker diarization via embedding-based greedy clustering (cosine threshold 0.45).

  2. Parakeet microservice (backend/parakeet/): FastAPI service loading NeMo parakeet-tdt-0.6b-v3 (multilingual, 25 European languages). Single endpoint POST /v1/transcribe accepts WAV file upload, returns {text, segments[{text, start, end}]}. Auth via ENCRYPTION_SECRET bearer token. Dockerfile with CUDA 13.2 runtime.

  3. Routing (routers/transcribe.py:358-359): Opt-in only — client must request stt_service=parakeet AND HOSTED_PARAKEET_API_URL must be set. Never auto-switches from Deepgram.

  4. STTService enum (streaming.py:39): parakeet is already a member alongside deepgram and modulate.

What's missing for production

  1. No PrerecordedSTTProvider implementation — Parakeet has no batch path. The pre_recorded.py factory only routes to Deepgram or Modulate. 9 callsites use batch transcription:

    • utils/chat.py (6 callsites): voice messages, desktop PTT, desktop transcribe
    • utils/conversations/postprocess_conversation.py (1): re-transcription
    • routers/sync.py (1): sync-local-files
    • utils/speaker_sample.py (1): speaker sample verification
  2. No NIM container deployment — current Dockerfile uses raw NeMo (nemo_toolkit[asr]), not the production-optimized NVIDIA NIM container (nvcr.io/nim/nvidia/parakeet-1-1b-rnnt-multilingual) which includes TensorRT + Triton for higher throughput.

  3. No native diarization — Parakeet returns no speaker labels. Current workaround embeds each segment via external HOSTED_SPEAKER_EMBEDDING_API_URL and does greedy clustering. This adds latency and depends on a separate service. Nemotron 3.5 ASR (0.6B, June 2026) may include native diarization — needs investigation.

  4. No language routing in factoryget_stt_service_for_language() has no Parakeet entry. It can't be selected via STT_SERVICE_MODELS env var like Deepgram/Modulate — only via explicit client opt-in.

  5. Model outdated — currently loads parakeet-tdt-0.6b-v3 (25 European languages). Nemotron 3.5 ASR 0.6B supports 40 languages including Vietnamese and Mandarin with 80ms streaming latency and native punctuation — better fit for Omi's multilingual user base.

Proposed work

Phase 1: Batch path + factory routing

  • Implement ParakeetPrerecordedProvider (implements PrerecordedSTTProvider ABC)
  • Add parakeet routing to get_prerecorded_service() and get_stt_service_for_language()
  • Support STT_SERVICE_MODELS=parakeet and STT_PRERECORDED_MODEL=parakeet env vars
  • All 9 batch callsites work without code changes (provider-agnostic ABC)

Phase 2: NIM container deployment

  • Switch from raw NeMo Dockerfile to NVIDIA NIM container (nvcr.io/nim/nvidia/parakeet-1-1b-rnnt-multilingual)
  • TensorRT optimization for ~238x real-time throughput on L4
  • Evaluate: deploy on existing GKE GPU node pool vs new dedicated instance
  • Health checks, autoscaling, resource limits

Phase 3: Model upgrade + diarization

  • Evaluate Nemotron 3.5 ASR 0.6B (40 languages, native punctuation, 80ms streaming)
  • Investigate native speaker diarization support — if absent, integrate NeMo MSDD or pyannote as sidecar
  • Test CJK character handling (Deepgram splits with spaces — Parakeet behavior unknown)
  • Test code-switching quality (language mixing within utterance)

Phase 4: Gradual rollout

  • A/B test via STT_SERVICE_MODELS=parakeet,dg-nova-3 (Parakeet primary, DG fallback)
  • Monitor: WER, first-segment latency (<10s target, <30s max), diarization accuracy
  • Track DG usage decline → eventual deprecation of DG as primary

Architecture reference

STTSocket (ABC)                      PrerecordedSTTProvider (ABC)
├─ SafeDeepgramSocket                ├─ DeepgramPrerecordedProvider
├─ SafeModulateSocket                ├─ ModulatePrerecordedProvider
└─ ParakeetStreamingSocket ✅        └─ ParakeetPrerecordedProvider ❌ (missing)

Streaming factory:                   Batch factory:
  get_stt_service_for_language()       get_prerecorded_service()
  → STT_SERVICE_MODELS env var         → STT_PRERECORDED_MODEL env var
  → routes: dg-nova-3, modulate        → routes: dg-nova-3, modulate
  → parakeet: opt-in only ⚠️           → parakeet: not supported ❌

Environment variables (current)

Var Default Purpose
HOSTED_PARAKEET_API_URL (unset) Parakeet service URL
PARAKEET_WINDOW_SECONDS 6.0 Streaming buffer window
PARAKEET_DIARIZATION 1 Enable embedding-based diarization
PARAKEET_MODEL nvidia/parakeet-tdt-0.6b-v3 NeMo model name
HOSTED_SPEAKER_EMBEDDING_API_URL (unset) Required for Parakeet diarization

Open questions

  1. Does Nemotron 3.5 ASR support speaker diarization natively? If not, what's the best sidecar (NeMo MSDD vs pyannote)?
  2. NIM container vs raw NeMo — latency/throughput benchmarks on L4?
  3. Where to host GPU workload — existing GKE T4 node pool ($1,204/mo already running) or new L4 instance?
  4. Code-switching quality (e.g. Vietnamese-English mixing) — Parakeet is reportedly weak here vs Modulate
  5. CJK tokenization — how does Parakeet handle word boundaries for Chinese/Japanese/Korean?

Best model candidates (from geni research)

Model Languages WER Speed Params Notes
Nemotron 3.5 ASR 0.6B 40 (incl. vi, zh) 80ms streaming 0.6B Recommended — native punctuation, CC-BY-4.0
Parakeet-TDT-0.6b-v2 English only 6.05% 3386x RT 0.6B Best English accuracy
Parakeet-TDT-0.6b-v3 25 European 6.32% 0.6B Currently loaded in main

Related


Status Update (2026-06-09)

Solved ✅ (PR #7653)

Feature Status
Batch ASR (/v1/transcribe) ✅ Prod — TDT 0.6b, 0.1% WER, full punctuation
Batch + diarization (/v2/transcribe) ✅ Prod — built-in pyannote/wespeaker on GPU
Streaming ASR (/v3/stream) ✅ Prod (opt-in) — RNNT 1.1b, 5s segments, AGC
Pre-recorded routing ✅ Prod — STT_PRERECORDED_MODEL=parakeet,dg-nova-3 with language fallback
Backend integration ParakeetPrerecordedProvider + ParakeetWebSocketSocket
Language fallback ✅ Parakeet for 25 EU langs, auto-fallback to Deepgram for CJK/Hindi/Korean
BF16 dual-model on L4 GPU ✅ TDT + RNNT in 12.6Gi/20Gi
GPU concurrency safety ✅ Semaphore serializes CUDA access
Helm chart + GitHub Actions workflow gcp_parakeet.yml
E2E device test (Omi BLE → app UI) ✅ Verified on Pixel 7a

Remaining

Feature Priority Details
Streaming as default HIGH Needs punctuation fix (RNNT outputs lowercase). Plan: interim RNNT → final TDT re-transcribe
Request batching HIGH Current: 1-at-a-time (semaphore). Target: batch_size=4-8 for ~4x throughput
NeMo thread-safety fix HIGH NeMo #15771model.transcribe() freeze/unfreeze race confirmed as bug, not intentional. New OSS ASR worker patching upstream
Split batch/stream semaphores MEDIUM Batch and streaming currently block each other
NIM or Triton migration MEDIUM Raw NeMo + FastAPI is prototype-grade per NVIDIA. NIM handles batching, streaming, concurrency natively
Nemotron Speech 0.6B LOW Half memory vs RNNT 1.1b, designed for streaming, better concurrent support
TensorRT export LOW ~2x faster inference, ~50% less memory

Prod Metrics (T+24h, stable)

  • 7,005 requests served, 0% error rate (post-semaphore fix)
  • v1=2.88s, v2=8.65s latency
  • 0 pod restarts, 13.0Gi memory (no leak)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestp3Priority: Backlog (score <14)

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions