Add HTS label support infrastructure#2132
Open
rokujyushi wants to merge 9 commits into
Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces new HTS label generation infrastructure and refactors existing ENUNU/nnmnkwii utilities into OpenUtau.Core to support richer HTS note/phoneme context features and better data flow through the render pipeline.
Changes:
- Added new HTS core model + context builder (
OpenUtau.Core/Util/HTS.cs) and new HTS label renderer/phonemizer bases (OpenUtau.Core/Hts/*). - Refactored nnmnkwii-related utility classes into
OpenUtau.Core/Util/*and updated ENUNU ONNX phonemizer imports/usages accordingly. - Extended render pipeline data to track per-phoneme
noteIndexand added tests for HTS spec/pipeline behavior.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| OpenUtau.Test/Plugins/HtsLabelPhonemizerTest.cs | Adds an end-to-end-ish HTS label + frontend feature extraction test using a dummy phonemizer. |
| OpenUtau.Test/Core/Util/HtsSpecTests.cs | Adds spec-like unit tests for HTS note/phrase context fields and alignment helpers. |
| OpenUtau.Plugin.Builtin/EnunuOnnx/HTS.cs | Removes the old monolithic HTS implementation from the ENUNU plugin. |
| OpenUtau.Plugin.Builtin/EnunuOnnx/EnunuOnnxPhonemizer.cs | Updates to new HTS structures and adds richer HTS note context + phoneme vowel-distance features. |
| OpenUtau.Core/Util/Scaler.cs | Moves Scaler into OpenUtau.Core.Util namespace for broader reuse. |
| OpenUtau.Core/Util/Python.cs | Moves nnmnkwii python exception types into OpenUtau.Core.Util.nnmnkwii.python. |
| OpenUtau.Core/Util/Merlin.cs | Moves Merlin frontend implementation into OpenUtau.Core.Util.nnmnkwii.frontend. |
| OpenUtau.Core/Util/HTSLabelFile.cs | Moves HTS label IO into OpenUtau.Core.Util.nnmnkwii.io.hts. |
| OpenUtau.Core/Util/HTS.cs | Introduces new HTS note/phoneme/phrase classes and alignment/context-building helpers. |
| OpenUtau.Core/Render/RenderPhrase.cs | Adds noteIndex propagation from notes to phones for better note↔phoneme mapping. |
| OpenUtau.Core/Hts/HTSLabelRenderer.cs | Adds a new renderer base that can emit HTS labels from RenderPhrase. |
| OpenUtau.Core/Hts/HTSLabelPhonemizer.cs | Adds a new phonemizer base that can generate HTS labels and align timings back into OpenUtau phonemes. |
Comments suppressed due to low confidence (2)
OpenUtau.Core/Util/HTS.cs:727
- Backward measure contexts likely have the same off-by-one issue:
measureIndexBackwardis set togroup.Count - noteIndex(1-based) andmeasurePercentBackwarduses(totalNotesInMeasure - 1)as denominator. The spec tests added in this PR expect 0-based backward indexes and percentages like 66/33/0 for 3 notes (denominator =totalNotesInMeasure).
for (int noteIndex = group.Count - 1; noteIndex >= 0; --noteIndex) {
var note = group[noteIndex];
int backwardIndex = group.Count - noteIndex;
note.measureIndexBackward = backwardIndex;
note.measureMsBackward = (int)Math.Round(accMsB / 100.0);
note.measureTickBackward = ticksPer96th > 0 ? (int)Math.Round((double)accTicksB / ticksPer96th) : 0;
note.measurePercentBackward = totalNotesInMeasure > 1 ? ((backwardIndex - 1) * 100) / (totalNotesInMeasure - 1) : 0;
OpenUtau.Core/Hts/HTSLabelPhonemizer.cs:74
g2pis loaded beforeLoadDict, butLoadG2padds entries fromphoneDict—which is still empty at that point. AlsoLoadG2pis called withsinger.Locationinstead of the computedrootPath, so it may read the wrongenunux.yaml. Consider loading the dictionary first (asEnunuOnnxPhonemizerdoes) and passingrootPathintoLoadG2p.
//Load g2p from enunux.yaml
//g2p dict should be load after enunu dict
try {
g2p = LoadG2p(singer.Location);
} catch (Exception e) {
Log.Error(e, "failed to load g2p dictionary");
return;
}
//Load Dictionary
var enunuDictPath = Path.Join(rootPath, tablePath);
try {
LoadDict(Path.Join(rootPath, tablePath), singer.TextFileEncoding);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The test assertions within the MeasureForwardBackwardAreComputedPerBar method have been fixed. The expected values for the indices e0, e1, and e2 have been modified, and the following items were updated: - Forward Index (e10): Fixed expected values for e0[9], e1[9], and e2[9]. - Backward Index (e11): Fixed expected values for e0[10], e1[10], and e2[10]. - Forward Percent (e16): Fixed expected values for e1[15] and e2[15]. - Backward Percent (e17): Fixed expected values for e0[16] and e1[16]. As a result, the test's expected values have been updated to align with the specifications.
…nting everything in the subclasses Change HTSLabelRenderer: SetUp method to abstract The SetUp method has been changed from virtual to abstract. As a result, all subclasses are now required to implement SetUp. The original logic within the SetUp method (initialization of phoneDict, language settings, dictionary loading, etc.) has been removed, and these responsibilities are now delegated to the subclasses.
The conditions for the IsSyllableVowelExtensionNote method have been expanded to recognize lyrics starting with specific symbols as vowel extension notes. Additionally, the calculation of phonemeDuration within the ProcessPart method has been removed, and a logic to directly calculate startMs and endMs has been introduced. In phoneme timing calculations, new logic considering headMs and phrase.positionMs has been added, and a process to adjust the end time of existing monoLabels has been implemented. This prevents overlaps and inconsistencies, improving the accuracy of timing. Furthermore, the startMs of the monoLabel at the end of a phrase has been changed to sentenceDurMs - tailMs to ensure that the timing of the entire phrase is accurately reflected.
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
OpenUtau.Plugin.Builtin/EnunuOnnx/EnunuOnnxPhonemizer.cs:563
- When building the HTS note list,
htsNotes[i]no longer hasindex,indexBackwards,sentenceDurMs, orsentenceDurTickspopulated (these assignments were removed).HTSNote.e()uses these fields for multiple contexts, so the dumped labels will contain incorrect/"xx"/potentially negative values (e.g., backward time). Restore initialization of these per-note fields (and consider whether padding notes should be excluded from totals).
var htsPhrase = new HTSPhrase(htsNotes.ToArray());
htsPhrase.totalNotes = htsNotes.Count;
htsPhrase.totalPhonemes = htsPhonemes.Count;
//make neighborhood links between htsNotes and between htsPhonemes
foreach (int i in Enumerable.Range(0, htsNotes.Count)) {
htsNotes[i].parent = htsPhrase;
if (i > 0) {
htsNotes[i].prev = htsNotes[i - 1];
htsNotes[i - 1].next = htsNotes[i];
}
}
Comment on lines
+57
to
+58
| }else if (File.Exists(Path.Join(singer.Location, "enuconfig.yaml"))) { | ||
| rootPath = Path.Combine(singer.Location, "enunux"); |
Comment on lines
+62
to
+77
| //Load g2p from enunux.yaml | ||
| //g2p dict should be load after enunu dict | ||
| try { | ||
| g2p = LoadG2p(rootPath); | ||
| } catch (Exception e) { | ||
| Log.Error(e, "failed to load g2p dictionary"); | ||
| return; | ||
| } | ||
| //Load Dictionary | ||
| var enunuDictPath = Path.Join(rootPath, tablePath); | ||
| try { | ||
| LoadDict(Path.Join(rootPath, tablePath), singer.TextFileEncoding); | ||
| } catch (Exception e) { | ||
| Log.Error(e, $"failed to load dictionary from {enunuDictPath}"); | ||
| return; | ||
| } |
| using OpenUtau.Core.Util; | ||
| using OpenUtau.Core.Util.nnmnkwii.io.hts; | ||
| using Serilog; | ||
| using static System.Net.Mime.MediaTypeNames; |
Comment on lines
+244
to
+268
| foreach (int i in Enumerable.Range(0, htsPhonemes.Length)) { | ||
| htsPhonemes[i].type = GetPhonemeType(htsPhonemes[i].symbol); | ||
| htsPhonemes[i].position = i + 1; | ||
| htsPhonemes[i].position_backward = htsPhonemes.Length - i; | ||
| if (htsPhonemes[i].type.Equals("c")) { | ||
| int prev = i - 1; | ||
| if (prev >= 0) { | ||
| if (htsPhonemes[prev].type.Equals("v")) { | ||
| htsPhonemes[i].prev_vowel_distance = 1; | ||
| } else { | ||
| htsPhonemes[i].prev_vowel_distance = htsPhonemes[prev].prev_vowel_distance + 1; | ||
| } | ||
| } | ||
| } | ||
| } | ||
| for (int i = htsPhonemes.Length - 1; i > 0; --i) { | ||
| if (htsPhonemes[i].type.Equals("c")) { | ||
| int next = i + 1; | ||
| if (next < htsPhonemes.Length) { | ||
| if (htsPhonemes[next].type.Equals("v")) { | ||
| htsPhonemes[i].next_vowel_distance = 1; | ||
| } else { | ||
| htsPhonemes[i].next_vowel_distance = htsPhonemes[next].next_vowel_distance + 1; | ||
| } | ||
| } |
…ote determination conditions In the GetSymbols method of HTSLabelPhonemizer.cs, note.lyric is now passed directly to g2p.Query without converting it to lowercase, making it possible to distinguish between uppercase and lowercase letters. In HTSLabelRenderer.cs, an overload that accepts a single symbol has been added to the makeHtsNote method. In the IsSyllableVowelExtensionNote method, the determination condition for vowel extension notes has been changed to only +~ or +*, making the determination more strict.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request refactors and enhances the EnunuOnnx phonemizer and its supporting utilities in OpenUtau. The main changes include moving and reorganizing utility files, improving HTS note and phoneme feature calculations, and enhancing the data passed between rendering components for better extensibility and accuracy.
Refactoring and File Organization:
HTSLabelFile.cs,Merlin.cs,Python.cs,Scaler.cs) fromOpenUtau.Plugin.Builtin/EnunuOnnx/toOpenUtau.Core/Util/and updated namespaces and references to reflect this change. This improves code organization and reuse. [1] [2] [3] [4] [5]HTS Note and Phoneme Feature Improvements:
HTSNoteobjects to include additional musical and timing information such as time signature, bar/beat position, slur/rest status, key, language, accent, and BPM. This provides richer context for downstream processing. [1] [2]HTSPhoneme, providing more accurate context features for each phoneme.Render Pipeline Enhancements:
RenderPhoneandRenderPhraseto pass and track the note index for each phoneme, allowing more precise mapping between notes and their phonemes during rendering. [1] [2] [3] [4]HTS Data Structure Updates:
HTS.csfile with more modular and maintainable classes, and updated the wayHTSNoteandHTSPhonemelink to their parent phrase and each other, supporting richer relationships and easier traversal. [1] [2]These changes collectively improve code maintainability, extensibility, and the accuracy of phoneme feature extraction, which is crucial for high-quality singing synthesis.
Refactoring and File Organization
OpenUtau.Plugin.Builtin/EnunuOnnx/toOpenUtau.Core/Util/and updated namespaces and references throughout the codebase. [1] [2] [3] [4] [5]HTS Note and Phoneme Feature Improvements
HTSNoteconstruction to include time signature, bar/beat position, slur/rest status, key, language, accent, and BPM for each note, improving context for phoneme processing. [1] [2]HTSPhoneme, increasing the accuracy of phoneme context features.Render Pipeline Enhancements
RenderPhoneand tracked note indexes inRenderPhrasefor more accurate mapping between notes and phonemes. [1] [2] [3] [4]HTS Data Structure Updates
HTS.csfile in favor of a more modular approach. [1] [2]