feat(stt): add NVIDIA Canary STT engine support#360
feat(stt): add NVIDIA Canary STT engine support#360coleleavitt wants to merge 3 commits intomkiol:mainfrom
Conversation
Add support for NVIDIA's Canary speech-to-text models via NeMo toolkit: - Canary 1B v2: 4.89% WER, 630x RTF (5x faster than Whisper) - Canary Qwen 2.5B: Higher accuracy variant for demanding use cases Both models use NeMo's EncDecMultiTaskModel architecture with automatic model download via HuggingFace. Supports GPU acceleration (CUDA/ROCm), translation (s2t_translation), and punctuation restoration. New files: - src/canary_engine.hpp: Engine class definition - src/canary_engine.cpp: NeMo Python integration via py_executor Modified: - models_manager.h/cpp: Add stt_canary engine type and feature flags - speech_service.cpp: Engine instantiation and type checking - CMakeLists.txt: Add canary_engine source files - config/models.json: Add both Canary model entries Requires: pip install nemo_toolkit[asr]
Check for nemo.collections.asr module availability at startup. This enables dsnote to automatically detect if NeMo is installed and show/hide Canary models accordingly in the UI. - py_tools.hpp: Add nemo_asr to libs_availability_t - py_tools.cpp: Add nemo.collections.asr import check - speech_service.cpp: Map nemo_asr availability to stt_canary
- Update CMakeLists.txt to use Qt6 instead of Qt5 - Update cmake/*.cmake files for Qt6 compatibility - Replace deprecated Qt5 APIs with Qt6 equivalents: - QRegExp -> QRegularExpression - QX11Info -> QNativeInterface::QX11Application - QMediaPlayer::State -> QMediaPlayer::PlaybackState - QMediaPlayer::stateChanged -> playbackStateChanged - setMedia(QMediaContent) -> setSource(QUrl) - QAudioInput (recording) -> QAudioSource - QAudioDeviceInfo -> QAudioDevice + QMediaDevices - QAudioFormat::setSampleSize/setCodec -> setSampleFormat - QNetworkRequest::FollowRedirectsAttribute -> RedirectPolicyAttribute - Remove Qt::AA_EnableHighDpiScaling (default in Qt6) - Remove QTextCodec usage - Remove QQuickStyle::availableStyles() (not in Qt6) - Fix GCC 15 type strictness (std::clamp/max int vs qsizetype) - Update qhotkey external project to build with Qt6
|
Hi. Sorry for late reply. I'm a bit busy at the moment and need a few more days to look at the code and test it. Thank you for your understanding. Something I can comment on immediately:
It is too radical change for now. I want you to revert it. The key problem is that I need to maintain both Qt5 and Qt6 compatible code base as the SFOS version does not run on Qt6. The work to move to the Qt6 is already started in Qt6 branch and I plan is to complete it for the next version. |
mkiol
left a comment
There was a problem hiding this comment.
It's very impressive and promising. I haven't been able to test it yet, as it's not finished. I really would like to merge it as soon as it's ready.
The most important to-dos:
- revert Qt6 changes
- fix model configuration (models.json) to make new engine usable - I can help with this
|
|
||
| configure_file(${dbus_dir}/dsnote.xml.in ${dbus_dsnote_interface_file}) | ||
|
|
||
| find_package(Qt5 COMPONENTS DBus REQUIRED) |
There was a problem hiding this comment.
Could you limit the scope of this PR to the new STT engine only? Migrating from Qt5 to Qt6 is a completely different task that requires a separate PR. If you agree, please revert everything related to Qt6. Thank you!
| { | ||
| "name": "Multilingual (Canary 1B v2)", | ||
| "model_id": "multilang_canary_1b_v2", | ||
| "engine": "stt_canary", | ||
| "lang_id": "multilang", | ||
| "info": "NVIDIA Canary 1B v2 - 4.89% WER, 5x faster than Whisper (RTFx 630), best accuracy-per-watt", | ||
| "options": "ti", | ||
| "score": 5, | ||
| "features": ["high_quality", "medium_processing", "stt_punctuation"], | ||
| "default_for_lang": true, | ||
| "hidden": false | ||
| }, | ||
| { | ||
| "name": "Multilingual (Canary Qwen 2.5B)", | ||
| "model_id": "multilang_canary_qwen", | ||
| "engine": "stt_canary", | ||
| "lang_id": "multilang", | ||
| "info": "NVIDIA Canary Qwen 2.5B - Larger model for maximum accuracy", | ||
| "options": "ti", | ||
| "score": 4, | ||
| "features": ["high_quality", "slow_processing", "stt_punctuation"], | ||
| "hidden": false | ||
| }, |
There was a problem hiding this comment.
This json file contains three objects: langs, models and packs. The model definition must be included in models and not in packs. Also urls must be specified. I can help you with this. Just tell me where I can download the model files.
Summary
Add support for NVIDIA's Canary speech-to-text models via NeMo toolkit.
Models Added
multilang_canary_1b_v2multilang_canary_qwenFeatures
fasterwhisper_enginepatternsFiles Changed
src/canary_engine.hpp,src/canary_engine.cppmodels_manager.h/cpp,speech_service.cpp,CMakeLists.txt,config/models.jsonRequirements
Why Canary?
Per the Open ASR Leaderboard:
Testing