Skip to content

[Feature]: optional TTS replies for voice/audio prompts #63

@shanekunz

Description

@shanekunz

Problem

The bot already supports voice/audio input via STT, which is great for mobile use.
A natural follow-up for that workflow is optional TTS output: when a user sends a Telegram voice/audio message toggle TTS on with /tts, the bot could return the normal text reply plus an audio rendering of that same final assistant response.
This would improve hands-free/mobile usability without changing the normal text-first workflow. (idea from OpenClaw).

Proposal

  • Keep normal text prompts exactly as they are today: text reply only
  • For Telegram voice / audio input:
    • transcribe with the existing STT flow
    • send the normal final text response
    • optionally send a TTS audio file of that exact final assistant text
  • Make it opt-in via env config make a /tts toggle, disabled by default

Proposed config

Something like:
- TTS_ENABLED=false use /tts toggle instead

  • TTS_API_URL= (fallback to STT_API_URL if unset)
  • TTS_API_KEY= (fallback to STT_API_KEY if unset)
  • TTS_MODEL=gpt-4o-mini-tts
  • TTS_VOICE=alloy

Scope / guardrails

To keep this small and low-risk:

  • no change for text-origin prompts
  • no streaming spoken output
  • just one final audio file after the normal text reply

Why this seems aligned

This stays within the current single-chat / predictable interaction model in CONCEPT.md:

  • it does not add parallelism or group-specific behavior
  • it only extends the existing voice-input path
  • it remains optional and disabled by default

Implementation notes

I already prototyped this locally my fork and it was pretty contained:

  • small TTS client modeled after the existing STT client
  • lightweight tracking so only audio-origin prompts trigger TTS
  • hook into the final assistant completion path after the normal text reply
  • docs + tests included

Done criteria (optional)

  • When sending a voice memo/file, and .env has it enabled and enabled via /tts, bot responds with text output and then an audio file of that text
    - When sending a text input, bot always replies as before with text regardless of configuration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions