ADK-Java provides an optional audio transcription capability that allows agents to transcribe audio data to text. The feature is designed to be optional, lazily loaded, and configurable via environment variables.
- Optional Feature: Works without configuration, enables when configured
- Lazy Loading: Services created only when needed, cached for reuse
- Multiple Service Support: Extensible architecture supporting multiple transcription services
- Async Processing: Built on RxJava for efficient asynchronous operations
- Streaming Support: Supports both batch and streaming transcription
- Environment Configuration: All configuration via environment variables (12-Factor App compliant)
The transcription capability follows several design patterns:
- Strategy Pattern: Pluggable transcription service implementations
- Factory Pattern: Lazy-loaded service creation with caching
- Builder Pattern: Flexible configuration management
- Optional Pattern: Graceful degradation when not configured
com.google.adk.transcription/
├── ServiceType.java # Service type enumeration
├── AudioFormat.java # Audio format specifications
├── TranscriptionException.java # Custom exception
├── ServiceHealth.java # Health status DTO
├── TranscriptionResult.java # Result DTO
├── TranscriptionEvent.java # Event DTO for streaming
├── TranscriptionService.java # Core interface
├── TranscriptionConfig.java # Configuration class
├── config/
│ └── TranscriptionConfigLoader.java # Environment config loader
├── client/
│ ├── WhisperRequest.java # Request DTO
│ ├── WhisperResponse.java # Response DTO
│ └── WhisperApiClient.java # HTTP client
├── strategy/
│ ├── WhisperTranscriptionService.java # Service implementation
│ └── TranscriptionServiceFactory.java # Factory
└── processor/
└── AudioChunkAggregator.java # Chunk aggregation
Required (for transcription to work):
ADK_TRANSCRIPTION_ENDPOINT=https://your-transcription-service:portOptional:
# Service type (default: inferred from endpoint)
ADK_TRANSCRIPTION_SERVICE_TYPE=whisper
# API key if required by service
ADK_TRANSCRIPTION_API_KEY=your-api-key
# Language code (default: auto-detect)
ADK_TRANSCRIPTION_LANGUAGE=en
# Timeout in seconds (default: 30)
ADK_TRANSCRIPTION_TIMEOUT_SECONDS=30
# Max retries (default: 3)
ADK_TRANSCRIPTION_MAX_RETRIES=3
# Chunk size in milliseconds for streaming (default: 500)
ADK_TRANSCRIPTION_CHUNK_SIZE_MS=500The simplest way to use transcription is through the TranscriptionTool, which can be added to any agent:
import com.google.adk.agents.LlmAgent;
import com.google.adk.tools.transcription.TranscriptionTool;
import com.google.adk.tools.FunctionTool;
// Create transcription tool (returns null if not configured)
FunctionTool transcriptionTool = TranscriptionTool.create();
if (transcriptionTool != null) {
LlmAgent agent = LlmAgent.builder()
.name("audio_agent")
.model("gemini-2.0-flash")
.instruction("Analyze audio files. Use transcribe_audio tool when needed.")
.addTool(transcriptionTool)
.build();
// Agent can now automatically call transcribe_audio tool
}if (TranscriptionTool.isAvailable()) {
// Transcription is configured and available
FunctionTool tool = TranscriptionTool.create();
agent.addTool(tool);
} else {
// Work without transcription
System.out.println("Transcription not configured");
}For more control, you can use the transcription service directly:
import com.google.adk.transcription.*;
import com.google.adk.transcription.config.TranscriptionConfigLoader;
import com.google.adk.transcription.strategy.TranscriptionServiceFactory;
// Load configuration from environment
Optional<TranscriptionConfig> config = TranscriptionConfigLoader.loadFromEnvironment();
if (config.isPresent()) {
// Get service (lazy loaded, cached)
TranscriptionService service = TranscriptionServiceFactory.getOrCreate(config.get());
// Synchronous transcription
byte[] audioData = ...; // Your audio bytes
TranscriptionResult result = service.transcribe(audioData, config.get());
System.out.println("Transcribed: " + result.getText());
}// Use RxJava Single for async transcription
Single<TranscriptionResult> resultFuture =
service.transcribeAsync(audioData, config.get());
resultFuture.subscribe(
result -> System.out.println("Transcribed: " + result.getText()),
error -> System.err.println("Error: " + error.getMessage())
);// Stream audio chunks and get transcription events
Flowable<byte[]> audioStream = ...; // Your audio stream
Flowable<TranscriptionEvent> transcriptionEvents =
service.transcribeStream(audioStream, config.get());
transcriptionEvents.subscribe(
event -> {
if (event.isFinished()) {
System.out.println("Final: " + event.getText());
} else {
System.out.println("Partial: " + event.getText());
}
}
);When used as an agent tool, transcription exposes the following function:
Function Name: transcribe_audio
Parameters:
audio_data(required): Base64-encoded audio datalanguage(optional): Language code (e.g., "en", "es", "fr")
Returns:
{
"text": "Transcribed text",
"language": "en",
"confidence": 0.95,
"duration": 5000
}Currently implemented:
- Whisper: HTTP-based Whisper API integration
Future support planned:
- Gemini Live API
- Azure Speech Services
- AWS Transcribe
Transcription operations throw TranscriptionException for errors. The service includes:
- Retry logic with exponential backoff
- Health check support
- Comprehensive error messages
- Service factory uses thread-safe caching
- Services are stateless and thread-safe
- Configuration objects are immutable
mvn compile -DskipTestsmvn test -Dtest=TranscriptionConfigTestThe TranscriptionServiceFactory implements lazy loading:
- Services are created only when first accessed
- Services are cached and reused
- Thread-safe implementation using
ConcurrentHashMap
The Whisper implementation uses OkHttp for HTTP requests:
- Connection pooling
- Configurable timeouts
- Retry logic with exponential backoff
- Health check support
- Supports multiple audio formats (PCM, WAV, MP3)
- Configurable sample rates and channels
- Chunk aggregation for efficient streaming
- Live streaming integration (real-time audio) is not yet implemented
- PostgreSQL storage integration is not yet implemented
- Additional service implementations (Gemini, Azure, AWS) are planned
- Real-time streaming handler integration
- Persistent storage for audio and metadata
- Additional transcription service implementations
- Enhanced error handling and retry strategies
- Performance optimizations and caching
Copyright 2025 Google LLC
Licensed under the Apache License, Version 2.0.