|
| 1 | +# Sarvam AI - ADK Integration Capabilities |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +The Sarvam AI module provides a comprehensive, production-grade integration of Sarvam AI services into the Google Agent Development Kit (ADK) for Java. It spans five service domains -- Chat, Speech-to-Text, Text-to-Speech, Vision, and Live Connections -- covering both REST and WebSocket protocols with full observability, resilience, and multi-turn agentic support. |
| 6 | + |
| 7 | +**Module path:** `contrib/sarvam-ai` |
| 8 | +**Package:** `com.google.adk.models.sarvamai` |
| 9 | +**Branch:** `sarvam-ai` |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +## 1. Chat Completions (LLM) |
| 14 | + |
| 15 | +**Class:** `SarvamAi` extends `BaseLlm` |
| 16 | +**Endpoint:** `POST /v1/chat/completions` (OpenAI-compatible) |
| 17 | + |
| 18 | +| Capability | Details | |
| 19 | +|---|---| |
| 20 | +| Blocking (non-streaming) | Full request/response cycle via `generateContent(request, false)` | |
| 21 | +| SSE Streaming | Real-time token-by-token delivery via `generateContent(request, true)` with backpressure (RxJava `Flowable`) | |
| 22 | +| Function / Tool Calling | ADK `FunctionDeclaration` serialized to OpenAI `tools` JSON with `tool_choice: auto` | |
| 23 | +| Multi-turn Tool History | Prior `tool_calls` correctly formatted as assistant messages with `tool_call_id`, `function.name`, `function.arguments`; tool responses sent as `role: tool` | |
| 24 | +| Streaming Function Calls | Chunked `name` and `arguments` accumulated across SSE deltas, emitted as final `FunctionCall` Part | |
| 25 | +| Token Usage Tracking | `prompt_tokens`, `completion_tokens`, `total_tokens` extracted for both blocking and streaming modes. Streaming uses `stream_options: {"include_usage": true}` | |
| 26 | +| System Instructions | ADK `GenerateContentConfig.systemInstruction` mapped to OpenAI `system` role message | |
| 27 | +| Temperature Control | Forwarded from `GenerateContentConfig.temperature` (default 0.7) | |
| 28 | +| Max Output Tokens | `GenerateContentConfig.maxOutputTokens` forwarded as `max_tokens` | |
| 29 | +| Top-P Sampling | Configurable via `SarvamAiConfig.topP()` | |
| 30 | +| Frequency / Presence Penalty | Configurable via `SarvamAiConfig` builder | |
| 31 | +| Reasoning Effort | Sarvam-specific `reasoning_effort` parameter (low / medium / high) | |
| 32 | +| Wiki Grounding | Sarvam-specific `wiki_grounding` toggle for factual grounding | |
| 33 | +| Role Translation | ADK `model` -> OpenAI `assistant`, `user` -> `user`, `functionResponse` -> `tool` | |
| 34 | +| Schema Normalization | Type strings lowercased, nested `items.properties` recursively normalized for OpenAI schema compatibility | |
| 35 | +| Graceful Degradation | Empty choices return empty text response instead of crashing | |
| 36 | + |
| 37 | +### Dual Implementation |
| 38 | + |
| 39 | +| Implementation | Location | Use Case | |
| 40 | +|---|---|---| |
| 41 | +| `SarvamBaseLM` | `core/src/main/java/.../models/SarvamBaseLM.java` | Lightweight, env-var driven. Used by `AgentModelConfig` and `LlmRegistry` for `Sarvam\|model` config strings | |
| 42 | +| `SarvamAi` | `contrib/sarvam-ai/src/.../SarvamAi.java` | Full-featured, Builder-pattern, OkHttp-based. Supports all chat parameters plus subservice access | |
| 43 | + |
| 44 | +--- |
| 45 | + |
| 46 | +## 2. Speech-to-Text (STT) |
| 47 | + |
| 48 | +**Class:** `SarvamSttService` implements `TranscriptionService` |
| 49 | +**Model:** `saaras:v3` |
| 50 | + |
| 51 | +| Capability | Details | |
| 52 | +|---|---| |
| 53 | +| REST Synchronous | `transcribe(byte[] audioData, TranscriptionConfig)` via `POST /speech-to-text` with multipart/form-data | |
| 54 | +| REST Async | `transcribeAsync()` executes on RxJava IO scheduler | |
| 55 | +| WebSocket Streaming | Real-time streaming via `wss://api.sarvam.ai/speech-to-text/streaming` with VAD (Voice Activity Detection) signals | |
| 56 | +| Transcription Modes | `transcribe`, `translate`, `verbatim`, `translit`, `codemix` | |
| 57 | +| Language Detection | Auto-detection supported; explicit BCP-47 codes (e.g., `hi-IN`, `en-IN`) also accepted | |
| 58 | +| VAD Signals | `speech_start` and `speech_end` events for voice activity boundaries | |
| 59 | +| ADK TranscriptionService | Full implementation of ADK's `TranscriptionService` interface including `isAvailable()`, `getServiceType()`, `getHealth()` | |
| 60 | + |
| 61 | +--- |
| 62 | + |
| 63 | +## 3. Text-to-Speech (TTS) |
| 64 | + |
| 65 | +**Class:** `SarvamTtsService` |
| 66 | +**Model:** `bulbul:v3` |
| 67 | + |
| 68 | +| Capability | Details | |
| 69 | +|---|---| |
| 70 | +| REST Synchronous | `synthesize(text, languageCode)` returns decoded WAV audio bytes | |
| 71 | +| REST Async | `synthesizeAsync()` on IO scheduler | |
| 72 | +| WebSocket Streaming | `synthesizeStream()` via `wss://api.sarvam.ai/text-to-speech/streaming` for low-latency progressive audio chunk delivery | |
| 73 | +| 30+ Speaker Voices | Configurable via `SarvamAiConfig.ttsSpeaker()` (default: `shubh`) | |
| 74 | +| Pace Control | Adjustable speech pace (0.5x to 2.0x) | |
| 75 | +| Sample Rate | Configurable output sample rate | |
| 76 | +| Base64 Decoding | Audio chunks automatically decoded from base64 to raw bytes | |
| 77 | +| WebSocket Lifecycle | Config frame -> text frame -> flush frame -> audio chunks -> final event -> close | |
| 78 | + |
| 79 | +--- |
| 80 | + |
| 81 | +## 4. Vision / Document Intelligence |
| 82 | + |
| 83 | +**Class:** `SarvamVisionService` |
| 84 | +**Model:** Sarvam Vision 3B VLM |
| 85 | + |
| 86 | +| Capability | Details | |
| 87 | +|---|---| |
| 88 | +| Multi-Language OCR | 23 languages (22 Indian + English) | |
| 89 | +| Input Formats | PDF, PNG, JPG, ZIP | |
| 90 | +| Output Formats | HTML or Markdown | |
| 91 | +| Async Job Pipeline | `createJob` -> `uploadDocument` (presigned URL) -> `startJob` -> `getJobStatus` (poll) -> `downloadResults` | |
| 92 | +| Convenience Method | `processDocument(filePath, languageCode, outputFormat)` runs the full pipeline with adaptive exponential backoff polling | |
| 93 | +| Polling Backoff | Starts at 2s, doubles up to 10s cap, max 60 polls (~2 min timeout) | |
| 94 | + |
| 95 | +--- |
| 96 | + |
| 97 | +## 5. Live Bidirectional Connection |
| 98 | + |
| 99 | +**Class:** `SarvamAiLlmConnection` implements `BaseLlmConnection` |
| 100 | + |
| 101 | +| Capability | Details | |
| 102 | +|---|---| |
| 103 | +| Multi-Turn Context | Maintains conversation history across turns, accumulates full model responses | |
| 104 | +| sendHistory | Replace full conversation context | |
| 105 | +| sendContent | Append a single turn and trigger streaming response | |
| 106 | +| receive | Returns `Flowable<LlmResponse>` via `PublishSubject` for reactive consumers | |
| 107 | +| Thread Safety | History list synchronized for concurrent access | |
| 108 | +| Realtime Guard | `sendRealtime(Blob)` throws `UnsupportedOperationException` with guidance to use STT/TTS services | |
| 109 | + |
| 110 | +--- |
| 111 | + |
| 112 | +## 6. Resilience & Configuration |
| 113 | + |
| 114 | +### Retry with Exponential Backoff |
| 115 | + |
| 116 | +**Class:** `SarvamRetryInterceptor` (OkHttp `Interceptor`) |
| 117 | + |
| 118 | +| Parameter | Value | |
| 119 | +|---|---| |
| 120 | +| Retryable codes | 429 (rate limit), 503, 5xx (server errors) | |
| 121 | +| Base delay | 500ms | |
| 122 | +| Max delay | 30s | |
| 123 | +| Strategy | Exponential backoff with 20% jitter | |
| 124 | +| Default max retries | 3 | |
| 125 | + |
| 126 | +### Immutable Configuration |
| 127 | + |
| 128 | +**Class:** `SarvamAiConfig` (Builder pattern) |
| 129 | + |
| 130 | +| Parameter | Default | |
| 131 | +|---|---| |
| 132 | +| Chat endpoint | `https://api.sarvam.ai/v1/chat/completions` | |
| 133 | +| STT endpoint | `https://api.sarvam.ai/speech-to-text` | |
| 134 | +| STT WebSocket | `wss://api.sarvam.ai/speech-to-text/streaming` | |
| 135 | +| TTS endpoint | `https://api.sarvam.ai/text-to-speech` | |
| 136 | +| TTS WebSocket | `wss://api.sarvam.ai/text-to-speech/streaming` | |
| 137 | +| Vision endpoint | `https://api.sarvam.ai/document-intelligence` | |
| 138 | +| Connect timeout | 30s | |
| 139 | +| Read timeout | 120s | |
| 140 | +| Max retries | 3 | |
| 141 | +| API key resolution | Explicit value > `SARVAM_API_KEY` env var | |
| 142 | + |
| 143 | +### Structured Error Handling |
| 144 | + |
| 145 | +**Class:** `SarvamAiException` extends `RuntimeException` |
| 146 | + |
| 147 | +| Field | Purpose | |
| 148 | +|---|---| |
| 149 | +| `statusCode` | HTTP status code from API | |
| 150 | +| `errorCode` | Sarvam-specific error code | |
| 151 | +| `requestId` | Sarvam request ID for support tracing | |
| 152 | +| `isRetryable()` | Programmatic check (429, 503, 5xx) | |
| 153 | + |
| 154 | +--- |
| 155 | + |
| 156 | +## 7. Authentication |
| 157 | + |
| 158 | +| Method | Header | Used By | |
| 159 | +|---|---|---| |
| 160 | +| API Subscription Key | `api-subscription-key: <key>` | `SarvamAi`, STT, TTS, Vision (contrib module) | |
| 161 | +| Bearer Token | `Authorization: Bearer <key>` | `SarvamBaseLM` (core module, OpenAI-compatible) | |
| 162 | +| Key Resolution | `SARVAM_API_KEY` env var or explicit via Builder | Both | |
| 163 | +| Fail-Fast Validation | Warning logged at construction if key is missing | `SarvamBaseLM` | |
| 164 | + |
| 165 | +--- |
| 166 | + |
| 167 | +## 8. Test Coverage |
| 168 | + |
| 169 | +| Test Class | Tests | Scope | |
| 170 | +|---|---|---| |
| 171 | +| `SarvamBaseLMTest` | 10 | Response parsing (text, null, tool calls), construction, connection type | |
| 172 | +| `SarvamAiTest` | - | Chat completion blocking and streaming | |
| 173 | +| `SarvamAiConfigTest` | - | Config builder validation, defaults, env var resolution | |
| 174 | +| `ChatRequestTest` | - | Request serialization from LlmRequest | |
| 175 | +| `SarvamSttServiceTest` | - | STT REST and WebSocket transcription | |
| 176 | +| `SarvamTtsServiceTest` | - | TTS REST and WebSocket synthesis | |
| 177 | +| `SarvamRetryInterceptorTest` | - | Retry logic, delay calculation, jitter | |
| 178 | +| `SarvamIntegrationTest` (rae) | 20 | End-to-end config wiring across properties, YAML, LlmRegistry | |
| 179 | + |
| 180 | +--- |
| 181 | + |
| 182 | +## 9. RAE Integration (Consumer Project) |
| 183 | + |
| 184 | +| Integration Point | Mechanism | File | |
| 185 | +|---|---|---| |
| 186 | +| Code-based agents | `AgentModelConfig` recognizes `Sarvam\|` prefix, instantiates `SarvamBaseLM` | `AgentModelConfig.java` | |
| 187 | +| YAML-based agents | `LlmRegistry.registerLlm("Sarvam\\|.*", ...)` factory | `ApplicationRegistry.java` | |
| 188 | +| Model metadata | `sarvam:` provider in `models.yaml` with feature declarations | `models.yaml` | |
| 189 | +| Config format | `Sarvam\|sarvam-m` -- single string works across both paths | `agent-models.properties` + `*.yaml` | |
| 190 | +| Global coverage | 43 code-based + 28 YAML agent configs switched to Sarvam | All agent config files | |
| 191 | + |
| 192 | +--- |
| 193 | + |
| 194 | +## Architecture Summary |
| 195 | + |
| 196 | +``` |
| 197 | +contrib/sarvam-ai/ |
| 198 | + src/main/java/com/google/adk/models/sarvamai/ |
| 199 | + SarvamAi.java # BaseLlm (chat, Builder pattern, OkHttp) |
| 200 | + SarvamAiConfig.java # Immutable config for all services |
| 201 | + SarvamAiException.java # Structured error with status/code/requestId |
| 202 | + SarvamAiLlmConnection.java # Live bidirectional multi-turn connection |
| 203 | + SarvamRetryInterceptor.java # Exponential backoff with jitter |
| 204 | + chat/ |
| 205 | + ChatRequest.java # OpenAI-compatible request model |
| 206 | + ChatResponse.java # Response deserialization |
| 207 | + ChatChoice.java # Choice wrapper |
| 208 | + ChatMessage.java # Message model |
| 209 | + ChatUsage.java # Token usage tracking |
| 210 | + stt/ |
| 211 | + SarvamSttService.java # REST + WebSocket STT (TranscriptionService) |
| 212 | + tts/ |
| 213 | + SarvamTtsService.java # REST + WebSocket TTS |
| 214 | + TtsRequest.java # TTS request model |
| 215 | + TtsResponse.java # TTS response model |
| 216 | + vision/ |
| 217 | + SarvamVisionService.java # Async job pipeline for document OCR |
| 218 | +
|
| 219 | +core/src/main/java/com/google/adk/models/ |
| 220 | + SarvamBaseLM.java # Lightweight BaseLlm for agent config integration |
| 221 | +``` |
0 commit comments