Problem
As voice and audio models become more prevalent, the system needs to support audio-specific token types and billing. Current system only tracks text tokens, missing audio input/output tokens that have different pricing.
Current State
- No
audio_tokens field in Usage model
- No extraction for audio-specific usage data
- No cost fields for audio token rates
- Audio transcription/speech tracked as regular tokens (if at all)
Future Audio Models to Support
OpenAI Audio
- Whisper (transcription): Charges per minute of audio
- TTS (text-to-speech): Charges per character generated
- GPT-4o Audio (future): Native audio in/out with specific token rates
Other Providers
- ElevenLabs: Per character or per minute pricing
- Anthropic Claude Audio (future): Expected audio token support
- Google Gemini Audio: Already supports audio with different rates
Technical Requirements
1. Update Usage Model
// ConduitLLM.Core/Models/Usage.cs
/// <summary>
/// Number of audio input tokens (for models processing audio).
/// </summary>
[JsonPropertyName("audio_input_tokens")]
[JsonIgnore(Condition = JsonIgnoreCondition.WhenWritingNull)]
public int? AudioInputTokens { get; set; }
/// <summary>
/// Number of audio output tokens (for models generating audio).
/// </summary>
[JsonPropertyName("audio_output_tokens")]
[JsonIgnore(Condition = JsonIgnoreCondition.WhenWritingNull)]
public int? AudioOutputTokens { get; set; }
/// <summary>
/// Duration of audio processed/generated in seconds.
/// </summary>
[JsonPropertyName("audio_duration_seconds")]
[JsonIgnore(Condition = JsonIgnoreCondition.WhenWritingNull)]
public double? AudioDurationSeconds { get; set; }
/// <summary>
/// Number of characters for TTS generation.
/// </summary>
[JsonPropertyName("tts_characters")]
[JsonIgnore(Condition = JsonIgnoreCondition.WhenWritingNull)]
public int? TtsCharacters { get; set; }
2. Update UsageExtractor
// Handle OpenAI Whisper/TTS format
if (usageElement.TryGetProperty("audio_seconds", out var audioSeconds))
usage.AudioDurationSeconds = audioSeconds.GetDouble();
// Handle audio token formats
if (usageElement.TryGetProperty("audio_input_tokens", out var audioInput))
usage.AudioInputTokens = audioInput.GetInt32();
if (usageElement.TryGetProperty("audio_output_tokens", out var audioOutput))
usage.AudioOutputTokens = audioOutput.GetInt32();
3. Update ModelCost Entity
/// <summary>
/// Cost per million audio input tokens.
/// </summary>
[Column(TypeName = "decimal(18, 10)")]
public decimal? AudioInputCostPerMillionTokens { get; set; }
/// <summary>
/// Cost per million audio output tokens.
/// </summary>
[Column(TypeName = "decimal(18, 10)")]
public decimal? AudioOutputCostPerMillionTokens { get; set; }
/// <summary>
/// Cost per minute of audio (Whisper-style pricing).
/// </summary>
[Column(TypeName = "decimal(18, 10)")]
public decimal? AudioCostPerMinute { get; set; }
/// <summary>
/// Cost per 1000 characters (TTS-style pricing).
/// </summary>
[Column(TypeName = "decimal(18, 10)")]
public decimal? TtsCostPerThousandCharacters { get; set; }
4. Update Cost Calculation
Handle different audio pricing models:
- Per token (GPT-4o audio)
- Per minute (Whisper)
- Per character (TTS)
5. Update PricingModel Enum
public enum PricingModel
{
Standard = 1,
// ... existing ...
AudioPerMinute = 10,
AudioPerCharacter = 11,
AudioTokenBased = 12
}
Example Pricing
OpenAI Whisper
- $0.006 per minute of audio
OpenAI TTS
- TTS: $15 per 1M characters
- TTS HD: $30 per 1M characters
Future GPT-4o Audio (speculative)
- Audio input: Different rate than text input
- Audio output: Different rate than text output
Impact
- Future Revenue: Will miss audio billing when these models are added
- Affected Models: Whisper, TTS, future multimodal models with audio
- Severity: Low (future need, not current)
Testing Requirements
- Unit tests for audio token extraction
- Cost calculation tests for different audio pricing models
- Integration tests with mock audio API responses
- Refund calculation tests including audio tokens
Priority
Low - Future-proofing for when audio models are added to the system. Not causing current revenue loss.
Problem
As voice and audio models become more prevalent, the system needs to support audio-specific token types and billing. Current system only tracks text tokens, missing audio input/output tokens that have different pricing.
Current State
audio_tokensfield in Usage modelFuture Audio Models to Support
OpenAI Audio
Other Providers
Technical Requirements
1. Update Usage Model
2. Update UsageExtractor
3. Update ModelCost Entity
4. Update Cost Calculation
Handle different audio pricing models:
5. Update PricingModel Enum
Example Pricing
OpenAI Whisper
OpenAI TTS
Future GPT-4o Audio (speculative)
Impact
Testing Requirements
Priority
Low - Future-proofing for when audio models are added to the system. Not causing current revenue loss.