Skip to content

feat: add Xiaomi MiMo speech support#2560

Open
xyuai wants to merge 2 commits into
Hmbown:mainfrom
xyuai:feat/xiaomi-mimo-speech
Open

feat: add Xiaomi MiMo speech support#2560
xyuai wants to merge 2 commits into
Hmbown:mainfrom
xyuai:feat/xiaomi-mimo-speech

Conversation

@xyuai
Copy link
Copy Markdown
Contributor

@xyuai xyuai commented Jun 2, 2026

Summary

  • Add Xiaomi MiMo speech support and configuration wiring.
  • Register the speech tool through TUI/agent setup paths.
  • Update provider docs and example configuration.

Validation

  • cargo fmt --check
  • cargo check -p codewhale-config -p codewhale-agent -p codewhale-cli -p codewhale-tui

Greptile Summary

This PR adds end-to-end Xiaomi MiMo TTS support: a synthesize_speech client method, a SpeechTool / tts alias tool, a codewhale speech CLI command, and all the config/model-registry wiring needed to make them work in both the interactive TUI and subagent contexts. The three issues raised in the previous review round (empty user message on no instruction, chat-only models in the model-visible list, and speech_output_dir not forwarded to subagents) have all been addressed and are backed by new tests.

  • crates/tui/src/client.rs – adds SpeechSynthesisRequest/Response and synthesize_speech, which POSTs to chat/completions with the spoken text in an assistant message; the optional style instruction is conditionally included as a user message only when non-empty.
  • crates/tui/src/tools/speech.rs – new model-visible SpeechTool with model inference (tts / voice-design / voice-clone), voice-clone data-URI encoding, network-policy checks, and workspace-bounded output-path resolution.
  • Config plumbing (SpeechConfig, [speech].output_dir, env-var overrides, EngineConfig::speech_output_dir, SubAgentRuntime::speech_output_dir) ensures the configured output directory is consistently inherited across the main engine and all subagent spawn paths.

Confidence Score: 5/5

Safe to merge; the three previously blocking issues are all fixed and covered by tests.

All three issues flagged in the prior review are resolved: the empty user-message bug is fixed in build_speech_synthesis_body, SUPPORTED_XIAOMI_MIMO_SPEECH_MODELS now contains only TTS model IDs (enforced by a new assertion test), and speech_output_dir is threaded through SubAgentRuntime and EngineConfig into every subagent speech-tool registration site. The remaining findings are minor validation and formatting gaps that don't affect correctness in normal usage.

The voice-resolution block in crates/tui/src/tools/speech.rs (and its mirror in crates/tui/src/main.rs) would benefit from a tighter guard when the resolved model is voiceclone but the supplied voice is a plain built-in ID rather than a data URI.

Important Files Changed

Filename Overview
crates/tui/src/tools/speech.rs New speech tool file implementing SpeechTool (model-visible) with model inference, voice clone encoding, network policy checks, and output path resolution. SUPPORTED_XIAOMI_MIMO_SPEECH_MODELS now correctly lists only TTS models. Validation gap: voiceclone model + non-data-URI voice silently passes through instead of giving a clear error.
crates/tui/src/client.rs Adds SpeechSynthesisRequest/Response structs and synthesize_speech method. Previously flagged empty-user-message bug is fixed via filter on instruction. parse_speech_audio_response handles both message.audio and top-level audio shapes.
crates/tui/src/main.rs Adds SpeechArgs and run_speech function for the CLI speech/tts command. Same voiceclone-with-non-data-URI validation gap as in speech.rs. Config-based output_dir fallback chain is correct.
crates/tui/src/tools/registry.rs Adds with_speech_tools builder method registering both 'speech' and 'tts' aliases. speech_output_dir is now correctly threaded through from SubAgentRuntime into with_full_agent_surface, resolving the previously flagged forwarding gap.
crates/tui/src/tools/subagent/mod.rs Adds speech_output_dir field to SubAgentRuntime and with_speech_output_dir builder. child_runtime() propagates the value. Test confirms inheritance.
crates/tui/src/config.rs Adds SpeechConfig struct, speech field in Config, speech_output_dir() resolver (env vars + toml), and canonical_xiaomi_mimo_model_id for TTS alias normalization. TTS model IDs added to model completion list.
crates/config/src/lib.rs Adds TTS model constants and canonical_xiaomi_mimo_model_id. normalize_model_for_provider correctly applies TTS alias expansion before other normalization paths.
crates/agent/src/lib.rs Registers four MiMo TTS ModelInfo entries with supports_tools=false, supports_reasoning=false. Aliases are consistent with config normalization.
config.example.toml Documents new TTS model IDs and [speech] config section. Three new TTS model comment lines use '?' as separator instead of the '—' used by all adjacent entries.
crates/tui/src/core/engine.rs Adds speech_output_dir field to EngineConfig and threads it into two SubAgentRuntime construction sites.

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as codewhale speech CLI
    participant Tool as SpeechTool (TUI)
    participant Client as DeepSeekClient
    participant API as Xiaomi MiMo API

    User->>CLI: codewhale speech "Hello" --model tts -o out.wav
    CLI->>CLI: infer_speech_model → mimo-v2.5-tts
    CLI->>CLI: "validate provider == xiaomi-mimo"
    CLI->>CLI: resolve output path
    CLI->>Client: synthesize_speech(model, text, instruction?, voice?)
    Client->>Client: wire_model_for_provider → canonical model ID
    Client->>Client: build_speech_synthesis_body
    Client->>API: POST /v1/chat/completions
    API-->>Client: JSON with audio.data base64
    Client->>Client: parse_speech_audio_response → decode base64
    Client-->>CLI: SpeechSynthesisResponse
    CLI->>CLI: fs::write(output_path, audio_bytes)
    CLI-->>User: Generated speech: out.wav (N bytes)

    Note over Tool,API: Agent/YOLO path: SpeechTool follows same flow
    Note over Tool,API: with network-policy check and workspace-bounded path resolution
Loading

Comments Outside Diff (1)

  1. crates/tui/src/tools/speech.rs, line 1595-1617 (link)

    P2 Significant helper duplication between this file and crates/tui/src/main.rs. combine_speech_instructions, normalize_speech_format, default_speech_output_name, encode_voice_clone_data_uri, and describe_speech_voice are copied verbatim (or near-verbatim) into both files. Additionally, canonical_xiaomi_mimo_model_id is duplicated between crates/config/src/lib.rs and crates/tui/src/config.rs. Any future fix to one copy will likely miss the other. These should be extracted to a shared module or, for the config normalizer, re-exported from the single canonical location.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    Fix in Codex Fix in Claude Code Fix in Cursor

Fix All in Codex Fix All in Claude Code Fix All in Cursor

Reviews (2): Last reviewed commit: "fix: harden Xiaomi MiMo speech flow" | Re-trigger Greptile

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Comment thread crates/tui/src/client.rs Outdated
Comment thread crates/tui/src/tools/speech.rs
Comment thread crates/tui/src/tools/registry.rs Outdated
@Hmbown
Copy link
Copy Markdown
Owner

Hmbown commented Jun 2, 2026

Thanks for adding the Xiaomi MiMo speech path. This is promising, but I would not harvest it into v0.8.50 yet because the current branch still has a few runtime correctness issues that would affect the default user flow.

Concrete next steps:

  1. In DeepSeekClient::synthesize_speech, only include the user message when the instruction is non-empty. The documented codewhale speech "text" --model tts path should not send "content": "".
  2. Keep SUPPORTED_XIAOMI_MIMO_SPEECH_MODELS to TTS-capable models only, so the tool does not advertise chat-only models that the TTS guard will reject.
  3. Thread the configured [speech].output_dir / XIAOMI_MIMO_SPEECH_OUTPUT_DIR through the subagent tool registration path too; right now parent and subagent invocations can disagree.
  4. Deduplicate the speech helper functions between crates/tui/src/main.rs and crates/tui/src/tools/speech.rs, and keep Xiaomi MiMo model normalization canonical in one module.
  5. Add focused tests for the no-instruction request body, supported-model list, configured output dir in the tool path, and one CLI passthrough smoke.

Once those are fixed, this looks like a good provider feature to revisit. I am keeping it out of the release harvest for now because provider features need to be boringly correct at the first documented invocation.

@xyuai
Copy link
Copy Markdown
Contributor Author

xyuai commented Jun 2, 2026

Thanks for the detailed feedback. I pushed an update in 2c34fee2 that addresses the five items:

  1. Omit the user message when instruction is empty.
  2. Restrict supported Xiaomi MiMo speech models to TTS-capable models.
  3. Thread speech_output_dir through the subagent tool registration/runtime path.
  4. Deduplicate speech helpers between the CLI and tool code, keeping MiMo speech model normalization in one module.
  5. Add focused tests for the no-instruction request body, supported-model list, configured output dir path, subagent inheritance, and CLI passthrough smoke.

Validation:

  • cargo fmt --check
  • cargo check -p codewhale-config -p codewhale-agent -p codewhale-cli -p codewhale-tui
  • focused cargo test -p codewhale-tui --bin codewhale-tui ... speech/subagent tests

@xyuai
Copy link
Copy Markdown
Contributor Author

xyuai commented Jun 2, 2026

@Hmbown I pushed the requested fixes in 2c34fee2 and added focused tests. The GitHub Actions check appears to need maintainer approval to run when you have a moment.

@Hmbown
Copy link
Copy Markdown
Owner

Hmbown commented Jun 2, 2026

Hey @xyuai — the Xiaomi MiMo speech support has been harvested into v0.8.50 (#2504)! The fix commit addressing the review feedback was solid — all 17 speech tests pass and the code is clean. Love seeing the full stack: model registry, CLI, tool registration, config, and tests all wired together. Really appreciate you pushing the fixes through. Thank you! 🐋🎤

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants