feat: production-grade NVIDIA Parakeet/Nemotron ASR — batch path, NIM deployment, diarization

## Summary

Upgrade the existing Parakeet ASR integration from opt-in streaming prototype to production-grade provider covering both streaming and batch paths, with proper GPU deployment and native diarization. Target: replace Deepgram for primary STT at 100-200x lower cost.

## Cost justification

Current Deepgram spend (May 2026, from mon):
- **Gross: $40,175/mo** | Net (with 1:1 match credits): $20,088/mo
- Usage: 68,081 hrs/mo (50,887 streaming + 17,194 batch)
- Rate breakdown: base $28K + diarize $6.1K + keyterm $5.4K + multilingual $0.6K
- ASR = 19.8% of total backend costs ($101.6K/mo)
- Match credits will expire — gross becomes actual bill

Self-hosted Parakeet estimate:
- 68,081 hrs/mo at ~$0.002/hr = **~$136/mo** compute
- L4 GPU instance: $350-500/mo (existing GKE has a T4 at $1,204/mo that could potentially be reused)
- **Projected savings: $19K-39K/mo** depending on credit expiry timing

## Current state (already in main)

### What exists

1. **Streaming socket** (`utils/stt/streaming.py:780-1013`): `ParakeetStreamingSocket` implements the `STTSocket` ABC. Buffers PCM16 in 6s windows, POSTs to `/v1/transcribe`, returns segment dicts. Online speaker diarization via embedding-based greedy clustering (cosine threshold 0.45).

2. **Parakeet microservice** (`backend/parakeet/`): FastAPI service loading NeMo `parakeet-tdt-0.6b-v3` (multilingual, 25 European languages). Single endpoint `POST /v1/transcribe` accepts WAV file upload, returns `{text, segments[{text, start, end}]}`. Auth via `ENCRYPTION_SECRET` bearer token. Dockerfile with CUDA 13.2 runtime.

3. **Routing** (`routers/transcribe.py:358-359`): Opt-in only — client must request `stt_service=parakeet` AND `HOSTED_PARAKEET_API_URL` must be set. Never auto-switches from Deepgram.

4. **STTService enum** (`streaming.py:39`): `parakeet` is already a member alongside `deepgram` and `modulate`.

### What's missing for production

1. **No `PrerecordedSTTProvider` implementation** — Parakeet has no batch path. The `pre_recorded.py` factory only routes to Deepgram or Modulate. 9 callsites use batch transcription:
   - `utils/chat.py` (6 callsites): voice messages, desktop PTT, desktop transcribe
   - `utils/conversations/postprocess_conversation.py` (1): re-transcription
   - `routers/sync.py` (1): sync-local-files
   - `utils/speaker_sample.py` (1): speaker sample verification

2. **No NIM container deployment** — current Dockerfile uses raw NeMo (`nemo_toolkit[asr]`), not the production-optimized NVIDIA NIM container (`nvcr.io/nim/nvidia/parakeet-1-1b-rnnt-multilingual`) which includes TensorRT + Triton for higher throughput.

3. **No native diarization** — Parakeet returns no speaker labels. Current workaround embeds each segment via external `HOSTED_SPEAKER_EMBEDDING_API_URL` and does greedy clustering. This adds latency and depends on a separate service. Nemotron 3.5 ASR (0.6B, June 2026) may include native diarization — needs investigation.

4. **No language routing in factory** — `get_stt_service_for_language()` has no Parakeet entry. It can't be selected via `STT_SERVICE_MODELS` env var like Deepgram/Modulate — only via explicit client opt-in.

5. **Model outdated** — currently loads `parakeet-tdt-0.6b-v3` (25 European languages). Nemotron 3.5 ASR 0.6B supports 40 languages including Vietnamese and Mandarin with 80ms streaming latency and native punctuation — better fit for Omi's multilingual user base.

## Proposed work

### Phase 1: Batch path + factory routing
- Implement `ParakeetPrerecordedProvider` (implements `PrerecordedSTTProvider` ABC)
- Add `parakeet` routing to `get_prerecorded_service()` and `get_stt_service_for_language()`
- Support `STT_SERVICE_MODELS=parakeet` and `STT_PRERECORDED_MODEL=parakeet` env vars
- All 9 batch callsites work without code changes (provider-agnostic ABC)

### Phase 2: NIM container deployment
- Switch from raw NeMo Dockerfile to NVIDIA NIM container (`nvcr.io/nim/nvidia/parakeet-1-1b-rnnt-multilingual`)
- TensorRT optimization for ~238x real-time throughput on L4
- Evaluate: deploy on existing GKE GPU node pool vs new dedicated instance
- Health checks, autoscaling, resource limits

### Phase 3: Model upgrade + diarization
- Evaluate Nemotron 3.5 ASR 0.6B (40 languages, native punctuation, 80ms streaming)
- Investigate native speaker diarization support — if absent, integrate NeMo MSDD or pyannote as sidecar
- Test CJK character handling (Deepgram splits with spaces — Parakeet behavior unknown)
- Test code-switching quality (language mixing within utterance)

### Phase 4: Gradual rollout
- A/B test via `STT_SERVICE_MODELS=parakeet,dg-nova-3` (Parakeet primary, DG fallback)
- Monitor: WER, first-segment latency (<10s target, <30s max), diarization accuracy
- Track DG usage decline → eventual deprecation of DG as primary

## Architecture reference

```
STTSocket (ABC)                      PrerecordedSTTProvider (ABC)
├─ SafeDeepgramSocket                ├─ DeepgramPrerecordedProvider
├─ SafeModulateSocket                ├─ ModulatePrerecordedProvider
└─ ParakeetStreamingSocket ✅        └─ ParakeetPrerecordedProvider ❌ (missing)

Streaming factory:                   Batch factory:
  get_stt_service_for_language()       get_prerecorded_service()
  → STT_SERVICE_MODELS env var         → STT_PRERECORDED_MODEL env var
  → routes: dg-nova-3, modulate        → routes: dg-nova-3, modulate
  → parakeet: opt-in only ⚠️           → parakeet: not supported ❌
```

## Environment variables (current)

| Var | Default | Purpose |
|-----|---------|---------|
| `HOSTED_PARAKEET_API_URL` | (unset) | Parakeet service URL |
| `PARAKEET_WINDOW_SECONDS` | 6.0 | Streaming buffer window |
| `PARAKEET_DIARIZATION` | 1 | Enable embedding-based diarization |
| `PARAKEET_MODEL` | nvidia/parakeet-tdt-0.6b-v3 | NeMo model name |
| `HOSTED_SPEAKER_EMBEDDING_API_URL` | (unset) | Required for Parakeet diarization |

## Open questions

1. Does Nemotron 3.5 ASR support speaker diarization natively? If not, what's the best sidecar (NeMo MSDD vs pyannote)?
2. NIM container vs raw NeMo — latency/throughput benchmarks on L4?
3. Where to host GPU workload — existing GKE T4 node pool ($1,204/mo already running) or new L4 instance?
4. Code-switching quality (e.g. Vietnamese-English mixing) — Parakeet is reportedly weak here vs Modulate
5. CJK tokenization — how does Parakeet handle word boundaries for Chinese/Japanese/Korean?

## Best model candidates (from geni research)

| Model | Languages | WER | Speed | Params | Notes |
|-------|-----------|-----|-------|--------|-------|
| **Nemotron 3.5 ASR 0.6B** | 40 (incl. vi, zh) | — | 80ms streaming | 0.6B | Recommended — native punctuation, CC-BY-4.0 |
| Parakeet-TDT-0.6b-v2 | English only | 6.05% | 3386x RT | 0.6B | Best English accuracy |
| Parakeet-TDT-0.6b-v3 | 25 European | 6.32% | — | 0.6B | Currently loaded in main |

## Related

- PR #7142 — Modulate Velma-2 provider (same ABC pattern)
- `backend/parakeet/` — existing microservice
- `backend/diarizer/` — existing pyannote speaker diarization service
---

## Status Update (2026-06-09)

### Solved ✅ (PR #7653)

| Feature | Status |
|---------|--------|
| Batch ASR (`/v1/transcribe`) | ✅ Prod — TDT 0.6b, 0.1% WER, full punctuation |
| Batch + diarization (`/v2/transcribe`) | ✅ Prod — built-in pyannote/wespeaker on GPU |
| Streaming ASR (`/v3/stream`) | ✅ Prod (opt-in) — RNNT 1.1b, 5s segments, AGC |
| Pre-recorded routing | ✅ Prod — `STT_PRERECORDED_MODEL=parakeet,dg-nova-3` with language fallback |
| Backend integration | ✅ `ParakeetPrerecordedProvider` + `ParakeetWebSocketSocket` |
| Language fallback | ✅ Parakeet for 25 EU langs, auto-fallback to Deepgram for CJK/Hindi/Korean |
| BF16 dual-model on L4 GPU | ✅ TDT + RNNT in 12.6Gi/20Gi |
| GPU concurrency safety | ✅ Semaphore serializes CUDA access |
| Helm chart + GitHub Actions workflow | ✅ `gcp_parakeet.yml` |
| E2E device test (Omi BLE → app UI) | ✅ Verified on Pixel 7a |

### Remaining

| Feature | Priority | Details |
|---------|----------|---------|
| Streaming as default | HIGH | Needs punctuation fix (RNNT outputs lowercase). Plan: interim RNNT → final TDT re-transcribe |
| Request batching | HIGH | Current: 1-at-a-time (semaphore). Target: `batch_size=4-8` for ~4x throughput |
| NeMo thread-safety fix | HIGH | [NeMo #15771](https://github.com/NVIDIA-NeMo/NeMo/issues/15771) — `model.transcribe()` freeze/unfreeze race confirmed as bug, not intentional. New OSS ASR worker patching upstream |
| Split batch/stream semaphores | MEDIUM | Batch and streaming currently block each other |
| NIM or Triton migration | MEDIUM | Raw NeMo + FastAPI is prototype-grade per NVIDIA. NIM handles batching, streaming, concurrency natively |
| Nemotron Speech 0.6B | LOW | Half memory vs RNNT 1.1b, designed for streaming, better concurrent support |
| TensorRT export | LOW | ~2x faster inference, ~50% less memory |

### Prod Metrics (T+24h, stable)
- 7,005 requests served, 0% error rate (post-semaphore fix)
- v1=2.88s, v2=8.65s latency
- 0 pod restarts, 13.0Gi memory (no leak)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: production-grade NVIDIA Parakeet/Nemotron ASR — batch path, NIM deployment, diarization #7651

Summary

Cost justification

Current state (already in main)

What exists

What's missing for production

Proposed work

Phase 1: Batch path + factory routing

Phase 2: NIM container deployment

Phase 3: Model upgrade + diarization

Phase 4: Gradual rollout

Architecture reference

Environment variables (current)

Open questions

Best model candidates (from geni research)

Related

Status Update (2026-06-09)

Solved ✅ (PR #7653)

Remaining

Prod Metrics (T+24h, stable)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Var	Default	Purpose
`HOSTED_PARAKEET_API_URL`	(unset)	Parakeet service URL
`PARAKEET_WINDOW_SECONDS`	6.0	Streaming buffer window
`PARAKEET_DIARIZATION`	1	Enable embedding-based diarization
`PARAKEET_MODEL`	nvidia/parakeet-tdt-0.6b-v3	NeMo model name
`HOSTED_SPEAKER_EMBEDDING_API_URL`	(unset)	Required for Parakeet diarization

Model	Languages	WER	Speed	Params	Notes
Nemotron 3.5 ASR 0.6B	40 (incl. vi, zh)	—	80ms streaming	0.6B	Recommended — native punctuation, CC-BY-4.0
Parakeet-TDT-0.6b-v2	English only	6.05%	3386x RT	0.6B	Best English accuracy
Parakeet-TDT-0.6b-v3	25 European	6.32%	—	0.6B	Currently loaded in main

Feature	Status
Batch ASR (`/v1/transcribe`)	✅ Prod — TDT 0.6b, 0.1% WER, full punctuation
Batch + diarization (`/v2/transcribe`)	✅ Prod — built-in pyannote/wespeaker on GPU
Streaming ASR (`/v3/stream`)	✅ Prod (opt-in) — RNNT 1.1b, 5s segments, AGC
Pre-recorded routing	✅ Prod — `STT_PRERECORDED_MODEL=parakeet,dg-nova-3` with language fallback
Backend integration	✅ `ParakeetPrerecordedProvider` + `ParakeetWebSocketSocket`
Language fallback	✅ Parakeet for 25 EU langs, auto-fallback to Deepgram for CJK/Hindi/Korean
BF16 dual-model on L4 GPU	✅ TDT + RNNT in 12.6Gi/20Gi
GPU concurrency safety	✅ Semaphore serializes CUDA access
Helm chart + GitHub Actions workflow	✅ `gcp_parakeet.yml`
E2E device test (Omi BLE → app UI)	✅ Verified on Pixel 7a

Feature	Priority	Details
Streaming as default	HIGH	Needs punctuation fix (RNNT outputs lowercase). Plan: interim RNNT → final TDT re-transcribe
Request batching	HIGH	Current: 1-at-a-time (semaphore). Target: `batch_size=4-8` for ~4x throughput
NeMo thread-safety fix	HIGH	NeMo #15771 — `model.transcribe()` freeze/unfreeze race confirmed as bug, not intentional. New OSS ASR worker patching upstream
Split batch/stream semaphores	MEDIUM	Batch and streaming currently block each other
NIM or Triton migration	MEDIUM	Raw NeMo + FastAPI is prototype-grade per NVIDIA. NIM handles batching, streaming, concurrency natively
Nemotron Speech 0.6B	LOW	Half memory vs RNNT 1.1b, designed for streaming, better concurrent support
TensorRT export	LOW	~2x faster inference, ~50% less memory

feat: production-grade NVIDIA Parakeet/Nemotron ASR — batch path, NIM deployment, diarization #7651

Description

Summary

Cost justification

Current state (already in main)

What exists

What's missing for production

Proposed work

Phase 1: Batch path + factory routing

Phase 2: NIM container deployment

Phase 3: Model upgrade + diarization

Phase 4: Gradual rollout

Architecture reference

Environment variables (current)

Open questions

Best model candidates (from geni research)

Related

Status Update (2026-06-09)

Solved ✅ (PR #7653)

Remaining

Prod Metrics (T+24h, stable)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions