Scope check
Due diligence
What problem does this solve?
RubyLLM provides a beautiful, unified interface for LLM capabilities — chat, paint, embed, transcribe. But audio is only half the story: we can turn speech into text (transcribe), but not text into speech.
Developers building voice-enabled apps, accessibility features, or content pipelines currently have to drop out of RubyLLM's DSL to wire up TTS manually — choosing an HTTP client, handling binary responses, managing provider auth separately. This breaks the "one gem, consistent interface" experience that makes RubyLLM great.
This issue proposes two related features:
RubyLLM.speak — core TTS API (primary focus)
- SSML Builder DSL — Ruby DSL for building SSML documents (future phase, inspired by
RubyLLM::Schema)
Proposed solution
API
Simple usage:
speech = RubyLLM.speak("Hello, world!")
speech.save("hello.mp3")
With options:
speech = RubyLLM.speak("Hello, world!", model: "tts-1-hd", format: "wav")
speech.save("hello.wav")
Per-call context:
context = RubyLLM.context { |c| c.openai_api_key = "sk-..." }
speech = context.speak("Hello!")
Response Object: RubyLLM::Speech
Following the pattern of Transcription, Image, Embedding:
class RubyLLM::Speech
attr_reader :data # Audio bytes (binary string)
attr_reader :model # Model ID used
attr_reader :format # Output format (mp3, wav, aac, flac, pcm)
def save(path) # Write audio to file
def to_blob # Raw binary data
def mime_type # e.g. "audio/mpeg"
end
Provider Examples
OpenAI (/v1/audio/speech):
# lib/ruby_llm/providers/openai/speech.rb
module RubyLLM::Providers::OpenAI::Speech
def speak(text, model:, format: "mp3", **options)
response = connection.post("/v1/audio/speech", {
model: model,
input: text,
voice: "alloy", # Single default voice for now
response_format: format
})
{ audio_data: response.body, format: format }
end
end
Azure (/cognitiveservices/v1):
# lib/ruby_llm/providers/azure/speech.rb
module RubyLLM::Providers::Azure::Speech
def speak(input, model:, format: "mp3", **options)
ssml = ssml?(input) ? input : wrap_in_ssml(input, voice: "en-US-AvaMultilingualNeural")
response = connection.post("cognitiveservices/v1") do |req|
req.headers["Content-Type"] = "application/ssml+xml"
req.headers["X-Microsoft-OutputFormat"] = audio_format(format)
req.headers["Ocp-Apim-Subscription-Key"] = config.azure_speech_api_key
req.body = ssml
end
{ audio_data: response.body, format: format }
end
private
def ssml?(input)
input.strip.start_with?("<speak")
end
def wrap_in_ssml(text, voice:)
<<~SSML
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<voice name="#{voice}">#{text}</voice>
</speak>
SSML
end
end
Configuration
RubyLLM.configure do |config|
config.default_speech_model = "tts-1" # New config attribute
end
Files to Add/Modify
| File |
Change |
lib/ruby_llm.rb |
Add self.speak method |
lib/ruby_llm/speech.rb |
New Speech class |
lib/ruby_llm/configuration.rb |
Add default_speech_model |
lib/ruby_llm/providers/openai.rb |
Include Speech module |
lib/ruby_llm/providers/openai/speech.rb |
New provider implementation |
lib/ruby_llm/providers/openai/capabilities.rb |
Add speech: true |
lib/ruby_llm/providers/azure/speech.rb |
New provider implementation |
| Model registry |
Register TTS models (tts-1, tts-1-hd) |
Why this belongs in RubyLLM
TTS isn't a simple API wrapper you'd write in application code. It requires:
- Model resolution across providers — the same model name might map to different endpoints on OpenAI vs Azure vs Google. RubyLLM's
Models.resolve already handles this.
- Provider abstraction — each TTS provider has different auth mechanisms, endpoints, request/response formats, and audio output options. App code shouldn't know these details.
- Configuration management — API keys, defaults, per-call overrides. RubyLLM's
Configuration and Context system already solves this.
- Binary response handling — TTS returns audio bytes, not JSON. This needs different connection/parsing logic that belongs in the provider layer.
Most importantly: transcribe (audio → text) is already in RubyLLM. speak (text → audio) is its natural counterpart. Leaving it out means developers use RubyLLM for 90% of their LLM needs but have to roll their own for this one capability — exactly the fragmentation RubyLLM was built to eliminate.
Non-Goals (for initial PR)
- Multiple voices / voice selection — each provider uses a single sensible default voice for now
- SSML support — separate issue
- Streaming audio — can layer on later
- Other providers beyond OpenAI + Azure — can be added incrementally
Related
Scope check
Due diligence
What problem does this solve?
RubyLLM provides a beautiful, unified interface for LLM capabilities —
chat,paint,embed,transcribe. But audio is only half the story: we can turn speech into text (transcribe), but not text into speech.Developers building voice-enabled apps, accessibility features, or content pipelines currently have to drop out of RubyLLM's DSL to wire up TTS manually — choosing an HTTP client, handling binary responses, managing provider auth separately. This breaks the "one gem, consistent interface" experience that makes RubyLLM great.
This issue proposes two related features:
RubyLLM.speak— core TTS API (primary focus)RubyLLM::Schema)Proposed solution
API
Simple usage:
With options:
Per-call context:
Response Object:
RubyLLM::SpeechFollowing the pattern of
Transcription,Image,Embedding:Provider Examples
OpenAI (
/v1/audio/speech):Azure (
/cognitiveservices/v1):Configuration
Files to Add/Modify
lib/ruby_llm.rbself.speakmethodlib/ruby_llm/speech.rbSpeechclasslib/ruby_llm/configuration.rbdefault_speech_modellib/ruby_llm/providers/openai.rbSpeechmodulelib/ruby_llm/providers/openai/speech.rblib/ruby_llm/providers/openai/capabilities.rbspeech: truelib/ruby_llm/providers/azure/speech.rbtts-1,tts-1-hd)Why this belongs in RubyLLM
TTS isn't a simple API wrapper you'd write in application code. It requires:
Models.resolvealready handles this.ConfigurationandContextsystem already solves this.Most importantly:
transcribe(audio → text) is already in RubyLLM.speak(text → audio) is its natural counterpart. Leaving it out means developers use RubyLLM for 90% of their LLM needs but have to roll their own for this one capability — exactly the fragmentation RubyLLM was built to eliminate.Non-Goals (for initial PR)
Related