feat: add Xiaomi MiMo speech support#2560
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Thanks for adding the Xiaomi MiMo speech path. This is promising, but I would not harvest it into v0.8.50 yet because the current branch still has a few runtime correctness issues that would affect the default user flow. Concrete next steps:
Once those are fixed, this looks like a good provider feature to revisit. I am keeping it out of the release harvest for now because provider features need to be boringly correct at the first documented invocation. |
|
Thanks for the detailed feedback. I pushed an update in
Validation:
|
|
@Hmbown I pushed the requested fixes in |
|
Hey @xyuai — the Xiaomi MiMo speech support has been harvested into v0.8.50 (#2504)! The fix commit addressing the review feedback was solid — all 17 speech tests pass and the code is clean. Love seeing the full stack: model registry, CLI, tool registration, config, and tests all wired together. Really appreciate you pushing the fixes through. Thank you! 🐋🎤 |
Summary
Validation
Greptile Summary
This PR adds end-to-end Xiaomi MiMo TTS support: a
synthesize_speechclient method, aSpeechTool/ttsalias tool, acodewhale speechCLI command, and all the config/model-registry wiring needed to make them work in both the interactive TUI and subagent contexts. The three issues raised in the previous review round (empty user message on no instruction, chat-only models in the model-visible list, andspeech_output_dirnot forwarded to subagents) have all been addressed and are backed by new tests.crates/tui/src/client.rs– addsSpeechSynthesisRequest/Responseandsynthesize_speech, which POSTs tochat/completionswith the spoken text in anassistantmessage; the optional style instruction is conditionally included as ausermessage only when non-empty.crates/tui/src/tools/speech.rs– new model-visibleSpeechToolwith model inference (tts/voice-design/voice-clone), voice-clone data-URI encoding, network-policy checks, and workspace-bounded output-path resolution.SpeechConfig,[speech].output_dir, env-var overrides,EngineConfig::speech_output_dir,SubAgentRuntime::speech_output_dir) ensures the configured output directory is consistently inherited across the main engine and all subagent spawn paths.Confidence Score: 5/5
Safe to merge; the three previously blocking issues are all fixed and covered by tests.
All three issues flagged in the prior review are resolved: the empty user-message bug is fixed in
build_speech_synthesis_body,SUPPORTED_XIAOMI_MIMO_SPEECH_MODELSnow contains only TTS model IDs (enforced by a new assertion test), andspeech_output_diris threaded throughSubAgentRuntimeandEngineConfiginto every subagent speech-tool registration site. The remaining findings are minor validation and formatting gaps that don't affect correctness in normal usage.The voice-resolution block in
crates/tui/src/tools/speech.rs(and its mirror incrates/tui/src/main.rs) would benefit from a tighter guard when the resolved model isvoiceclonebut the supplied voice is a plain built-in ID rather than a data URI.Important Files Changed
Sequence Diagram
sequenceDiagram participant User participant CLI as codewhale speech CLI participant Tool as SpeechTool (TUI) participant Client as DeepSeekClient participant API as Xiaomi MiMo API User->>CLI: codewhale speech "Hello" --model tts -o out.wav CLI->>CLI: infer_speech_model → mimo-v2.5-tts CLI->>CLI: "validate provider == xiaomi-mimo" CLI->>CLI: resolve output path CLI->>Client: synthesize_speech(model, text, instruction?, voice?) Client->>Client: wire_model_for_provider → canonical model ID Client->>Client: build_speech_synthesis_body Client->>API: POST /v1/chat/completions API-->>Client: JSON with audio.data base64 Client->>Client: parse_speech_audio_response → decode base64 Client-->>CLI: SpeechSynthesisResponse CLI->>CLI: fs::write(output_path, audio_bytes) CLI-->>User: Generated speech: out.wav (N bytes) Note over Tool,API: Agent/YOLO path: SpeechTool follows same flow Note over Tool,API: with network-policy check and workspace-bounded path resolutionComments Outside Diff (1)
crates/tui/src/tools/speech.rs, line 1595-1617 (link)crates/tui/src/main.rs.combine_speech_instructions,normalize_speech_format,default_speech_output_name,encode_voice_clone_data_uri, anddescribe_speech_voiceare copied verbatim (or near-verbatim) into both files. Additionally,canonical_xiaomi_mimo_model_idis duplicated betweencrates/config/src/lib.rsandcrates/tui/src/config.rs. Any future fix to one copy will likely miss the other. These should be extracted to a shared module or, for the config normalizer, re-exported from the single canonical location.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Reviews (2): Last reviewed commit: "fix: harden Xiaomi MiMo speech flow" | Re-trigger Greptile