feat: HF dataset loader, intent expansion, and locale export#20
feat: HF dataset loader, intent expansion, and locale export#20JarbasAl wants to merge 9 commits into
Conversation
Introduce ovos_spec_tools.datasets — a new module for loading, expanding, and exporting OVOS-INTENT-2 templates from HuggingFace datasets: * load_dataset_templates() — load from hass-intent-templates, intents-for-eval, or massive-templates via datasets library * expand_hf_template() — resolve <keyword>, (a|b), [x] into concrete utterances using the existing expansion.py engine * export_to_locale() — write .intent / .voc / .entity files into a standard OVOS-INTENT-2 locale directory tree Add example scripts for the full dataset generation pipeline: * convert_hassil_intents.py — Home Assistant hassil → OVOS locale * export_hf_dataset.py — locale → multi-config HF dataset * generate_entities.py — auto-generate missing .entity files * reexport_recursive.py — recursively resolve nested <keyword> refs * reexport_uniform.py — uniform list<struct> schema for expansions * hf_dataset.py — unified CLI for all three supported datasets Update pyproject.toml with optional 'datasets' dependency. Add 17 tests covering config resolution, normalization, expansion, and locale export.
|
Important Review skippedToo many files! This PR contains 296 files, which is 146 over the limit of 150. To get a review, narrow the scope: ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (296)
You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR adds comprehensive HuggingFace dataset support to ChangesDataset Loading and Export Feature
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Automated check summary ready. 📊I've aggregated the results of the automated checks for this PR below. 🔍 LintThe automated checks have finished their work. 🏁 ❌ ruff: issues found — see job log 🔨 Build TestsChecking if the code is properly tempered. ⚔️
❌ 3.10: Install OK, tests failed 📊 CoverageIs the code fully immunized with tests? 💉 ✅ 94.0% total coverage Per-file coverage (10 files)
Full report: download the 🔒 Security (pip-audit)Ensuring our password hashing is up to date. 🔨 ✅ No known vulnerabilities found (32 packages scanned). 🏷️ Release PreviewEnsuring our release process remains smooth and efficient. 🚂 Current:
✅ PR title follows conventional commit format. 🚀 Release Channel Compatibility Predicted next version:
📋 Repo HealthScanning for any signs of 'orphaned' code limbs. 🦾 ✅ All required files present. Latest Version: ✅ ⚖️ License CheckI've checked the genealogical tree of your licenses. 🌳 ❌ License violations detected (4 packages) — review required before merging. License distribution: 1× Apache Software License, 1× Apache-2.0 OR BSD-2-Clause, 1× MIT, 1× MIT License Full breakdown — 4 packages
Policy: Apache 2.0 (universal donor). StrongCopyleft / NetworkCopyleft / WeakCopyleft / Other / Error categories fail. MPL allowed. An automated high-five for your latest changes! 🖐️ |
There was a problem hiding this comment.
Actionable comments posted: 13
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pyproject.toml (1)
33-36:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
testextra omitsdatasets.CI runs
test/test_datasets.py::TestExportToLocale::test_round_trip_small, which calls the realload_dataset_templates(...)and triggersImportErrorbecausedatasetsisn't installed in the test environment. The primary fix is making that test hermetic (see the comment intest/test_datasets.py); however, if any test is intended to exercise the real loader,datasetsmust also be added here.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pyproject.toml` around lines 33 - 36, The test failure is caused by the test importing the real datasets package via load_dataset_templates in TestExportToLocale::test_round_trip_small (test/test_datasets.py); either make that test hermetic by mocking/stubbing load_dataset_templates (or patching the datasets import) so it doesn't require the real package, or if the intent is to exercise the real loader, add "datasets" to the test extras in pyproject.toml (the test = [...] list) so the CI environment installs it; update whichever approach you choose and rerun tests.
🧹 Nitpick comments (4)
ovos_spec_tools/datasets.py (1)
284-289: 💤 Low valueDead loop over
<keyword>references.This block iterates
VOC_REmatches but only assignskwand falls through with a comment — it produces no output and has no side effects. Either drop it, or implement the intended fallback (e.g. registering an empty.vocplaceholder so downstream<keyword>refs resolve).♻️ Suggested removal
- # If no expansions but the template has <keyword> refs, - # try to extract inline alternations - if not row.get("expansions"): - for m in VOC_RE.finditer(tpl): - kw = m.group(1) - # No expansion data available; carry on -🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ovos_spec_tools/datasets.py` around lines 284 - 289, The loop over VOC_RE.finditer(tpl) (inside the branch if not row.get("expansions")) is dead — it only assigns kw and does nothing; either remove the loop entirely or implement the intended fallback: for each match extract kw and register an empty expansion placeholder so downstream <keyword> references resolve (e.g., ensure row has an "expansions" mapping and add an entry for kw with an empty list/placeholder). Update the code around VOC_RE, tpl, and row to create row.setdefault("expansions", {})[kw] = [] (or equivalent placeholder structure used by downstream code) so the template fallback works, or delete the whole block if the fallback is not needed.examples/hass-intent-dataset/reexport_recursive.py (1)
29-69: 💤 Low valueUnbounded recursive expansion.
_resolve_valueproduces the full cartesian set of nested<keyword>substitutions with no cap; deeply nested vocabs can blow up memory. Cycles are guarded byseen, but breadth is not. Consider a max-results cap if this runs on large corpora.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/hass-intent-dataset/reexport_recursive.py` around lines 29 - 69, The recursive expansion can explode in breadth; modify _resolve_value and _resolve_expansions to accept a max_results (or limit) parameter and enforce it during recursion: thread the limit through calls to _resolve_value from _resolve_expansions, stop expanding further branches once the accumulating results list reaches the limit (check len(results) during the loop and return early), and propagate that early-return up the recursion so the overall result set never exceeds max_results; also apply the same cap when building flat_values before deduplication to prevent excessive memory use.examples/hass-intent-dataset/generate_entities.py (1)
260-263: ⚡ Quick winDefine
ltstate inline instead of patching after the dict.
lt.stateis declared as a nested list[[...]](line 261) and then corrected by the post-dict block at lines 366-368. Define the correct flat list directly and drop the patch.♻️ Proposed fix
"lt": { - "state": [["įjungta", "išjungta", "atidaryta", "uždaryta", "užrakinta", "atrakinta"]], + "state": ["įjungta", "išjungta", "atidaryta", "uždaryta", "užrakinta", "atrakinta"], "color": ["balta", "juoda", "raudona", "oranžinė", "geltona", "žalia", "mėlyna", "violetinė", "rudа", "rožinė", "turkio"], },And remove the patch block:
-# Fix lt state -if "lt" in _LANG_OVERRIDES: - _LANG_OVERRIDES["lt"]["state"] = ["įjungta", "išjungta", "atidaryta", "uždaryta", "užrakinta", "atrakinta"]Also applies to: 366-368
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/hass-intent-dataset/generate_entities.py` around lines 260 - 263, The lt entry in the languages dict defines "state" as a nested list (e.g. [["įjungta", ...]]) but is later patched to a flat list; change the "lt" -> "state" value to a flat list directly (remove the extra nesting) and delete the subsequent post-dict patch that overwrites lt['state'] (the block that sets lt['state'] after the dict). Update references to "lt" and "state" accordingly so no later correction is needed.examples/hass-intent-dataset/convert_hassil_intents.py (1)
1226-1226: ⚡ Quick winSimplify nested if-block detection.
The condition
"{% if" in branch[len(branch) - len(branch.lstrip()):]is unnecessarily complex. The expressionlen(branch) - len(branch.lstrip())computes the number of leading whitespace characters, and slicing from that position givesbranch.lstrip(). This is equivalent to:"{% if" in branch.lstrip()or simply:
"{% if" in branchsince the presence check doesn't require stripping.
♻️ Proposed simplification
- sub_branches = _split_jinja_if(branch) if "{% if" in branch[len(branch) - len(branch.lstrip()):] else [branch] + sub_branches = _split_jinja_if(branch) if "{% if" in branch else [branch]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/hass-intent-dataset/convert_hassil_intents.py` at line 1226, The conditional that detects a leading Jinja if in the assignment to sub_branches is overly complex; simplify the check by testing the stripped or full string instead of computing leading-whitespace length—replace the expression used in the ternary for sub_branches (which references branch and calls _split_jinja_if(branch)) with a simpler membership test like "{% if" in branch.lstrip() (or "{% if" in branch) so the branch is split via _split_jinja_if(branch) only when appropriate.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/hass-intent-dataset/convert_hassil_intents.py`:
- Line 47: The import of the sys module is unused; remove the top-level "import
sys" statement so the script no longer contains an unused import (look for the
"import sys" line in convert_hassil_intents.py and delete it).
- Line 1759: The print statement currently uses an unnecessary f-string: replace
the f-prefixed call to print (the line printing " --check: intents with <50%
sample survival --") with a plain string literal to avoid misleading formatting
usage; locate the print(...) invocation in convert_hassil_intents.py and remove
the leading "f" so it becomes print(" --check: intents with <50% sample
survival --").
- Line 702: The Danish "da" mapping in the dictionary contains a duplicate key
"abn" (second occurrence at the shown diff); remove the redundant "abn": "open"
entry so the "da" dict only has a single "abn" mapping, ensuring no duplicate
keys remain in the dictionary (locate the "da" dict in convert_hassil_intents.py
to make the edit).
In `@examples/hass-intent-dataset/export_hf_dataset.py`:
- Line 33: Remove the unused import statement "import os" from the top-level
imports in export_hf_dataset.py (the unused symbol is the os import) so the lint
error ruff F401 is resolved; verify there are no references to os elsewhere in
the file and run the linter/CI to confirm the import removal fixes the failing
check.
In `@examples/hass-intent-dataset/generate_entities.py`:
- Around line 360-363: The dict _LANG_OVERRIDES in generate_entities.py contains
a duplicated "pl" key; remove the second "pl" entry (the block with "state" and
"color") so there is only one "pl" definition in _LANG_OVERRIDES, or if the two
differ intentionally, merge their values into the single "pl" entry instead;
ensure no duplicate keys remain to fix the ruff F601 CI failure.
- Line 44: The domain list contains a value with a leading space (" humidifier")
which will produce incorrect vocabulary entries; remove the stray space so it
reads "humidifier" and also defensively trim domain strings before writing
entities (e.g., sanitize the domain list or apply .strip()/trim() to each item)
to prevent any other leading/trailing whitespace from leaking into generated
.entity files.
- Line 460: The import for the standard library module "re" is declared at the
bottom of the module but used earlier (around line 438), causing a lint error
(E402); move the line "import re" up into the top import block with the other
imports so the module-level imports appear before any code or usage, ensuring
the name "re" is available where it's referenced.
- Around line 6-9: The file imports unused symbols causing lint/CI
failures—remove the unused imports json, os, and defaultdict from the top of
generate_entities.py and leave only the required import(s) (e.g., pathlib.Path)
after verifying Path is actually used; update the import line(s) accordingly so
only used modules are imported.
In `@examples/hass-intent-dataset/reexport_uniform.py`:
- Around line 47-54: The code builds expansions from refs returned by
_extract_keyword_refs but preserves duplicates; change the loop that creates
expansions from refs so it deduplicates while preserving order: iterate refs,
keep a local seen set, for each ref skip if already seen, otherwise look up vals
= vocabs.get(ref) and if vals append {"keyword": ref, "values": vals} and add
ref to seen, then assign row["expansions"] only if expansions is non-empty;
update the block around refs/expansions to use this seen-based dedupe.
In `@examples/hf_dataset.py`:
- Line 17: Remove the unused top-level import "sys" from examples/hf_dataset.py:
delete the "import sys" statement (it is unused and ruff F401-failing), leaving
the local "argparse" usage inside main intact; verify there are no other
references to "sys" such as in function main or elsewhere before committing.
In `@ovos_spec_tools/datasets.py`:
- Around line 54-58: The SUPPORTED_DATASETS mapping in
ovos_spec_tools/datasets.py incorrectly maps the key "hassil-intents" to
"OpenVoiceOS/hassil-intents-locale" while the module docstring and docs expect
"OpenVoiceOS/hass-intent-templates"; update the dictionary entry in
SUPPORTED_DATASETS to use "OpenVoiceOS/hass-intent-templates" for
"hassil-intents" and ensure any related references (module docstring or
consumers of SUPPORTED_DATASETS) remain consistent; run or adjust
test_urls_valid if needed to validate the full repo name rather than only the
"OpenVoiceOS/" prefix.
In `@test/test_datasets.py`:
- Around line 204-207: The test test_urls_valid is too permissive—it's only
checking for a slash and the "OpenVoiceOS/" prefix on entries from
SUPPORTED_DATASETS, which allows wrong repo names to slip through; update the
test to assert the exact expected repository names for each key in
SUPPORTED_DATASETS (or assert the set of SUPPORTED_DATASETS.values() equals an
expected set/list of repo strings) so the registry cannot silently drift—modify
test_urls_valid to compare SUPPORTED_DATASETS against the precise expected repo
names.
- Around line 171-190: The test test_round_trip_small currently calls
load_dataset_templates before the patch, causing a real dataset fetch; fix it by
removing the real call and creating a local fixture list (e.g., a small list of
dicts with intent_id and template) to use as rows, then patch
ovos_spec_tools.datasets.load_dataset_templates to return that fixture before
calling export_to_locale; ensure you derive first = fixture[:2] and use
fixture[0]["intent_id"] and fixture[0]["template"] when checking the exported
intent file so the test remains hermetic and does not hit the datasets library.
---
Outside diff comments:
In `@pyproject.toml`:
- Around line 33-36: The test failure is caused by the test importing the real
datasets package via load_dataset_templates in
TestExportToLocale::test_round_trip_small (test/test_datasets.py); either make
that test hermetic by mocking/stubbing load_dataset_templates (or patching the
datasets import) so it doesn't require the real package, or if the intent is to
exercise the real loader, add "datasets" to the test extras in pyproject.toml
(the test = [...] list) so the CI environment installs it; update whichever
approach you choose and rerun tests.
---
Nitpick comments:
In `@examples/hass-intent-dataset/convert_hassil_intents.py`:
- Line 1226: The conditional that detects a leading Jinja if in the assignment
to sub_branches is overly complex; simplify the check by testing the stripped or
full string instead of computing leading-whitespace length—replace the
expression used in the ternary for sub_branches (which references branch and
calls _split_jinja_if(branch)) with a simpler membership test like "{% if" in
branch.lstrip() (or "{% if" in branch) so the branch is split via
_split_jinja_if(branch) only when appropriate.
In `@examples/hass-intent-dataset/generate_entities.py`:
- Around line 260-263: The lt entry in the languages dict defines "state" as a
nested list (e.g. [["įjungta", ...]]) but is later patched to a flat list;
change the "lt" -> "state" value to a flat list directly (remove the extra
nesting) and delete the subsequent post-dict patch that overwrites lt['state']
(the block that sets lt['state'] after the dict). Update references to "lt" and
"state" accordingly so no later correction is needed.
In `@examples/hass-intent-dataset/reexport_recursive.py`:
- Around line 29-69: The recursive expansion can explode in breadth; modify
_resolve_value and _resolve_expansions to accept a max_results (or limit)
parameter and enforce it during recursion: thread the limit through calls to
_resolve_value from _resolve_expansions, stop expanding further branches once
the accumulating results list reaches the limit (check len(results) during the
loop and return early), and propagate that early-return up the recursion so the
overall result set never exceeds max_results; also apply the same cap when
building flat_values before deduplication to prevent excessive memory use.
In `@ovos_spec_tools/datasets.py`:
- Around line 284-289: The loop over VOC_RE.finditer(tpl) (inside the branch if
not row.get("expansions")) is dead — it only assigns kw and does nothing; either
remove the loop entirely or implement the intended fallback: for each match
extract kw and register an empty expansion placeholder so downstream <keyword>
references resolve (e.g., ensure row has an "expansions" mapping and add an
entry for kw with an empty list/placeholder). Update the code around VOC_RE,
tpl, and row to create row.setdefault("expansions", {})[kw] = [] (or equivalent
placeholder structure used by downstream code) so the template fallback works,
or delete the whole block if the fallback is not needed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3b6dcafe-d104-4e1f-b5ff-bae3f224aa8f
📒 Files selected for processing (15)
docs/README.mddocs/api-reference.mddocs/datasets.mdexamples/hass-intent-dataset/APPENDIX.mdexamples/hass-intent-dataset/convert_hassil_intents.pyexamples/hass-intent-dataset/export_hf_dataset.pyexamples/hass-intent-dataset/generate_entities.pyexamples/hass-intent-dataset/reexport_recursive.pyexamples/hass-intent-dataset/reexport_uniform.pyexamples/hf_dataset.pyexamples/locale_to_hf_dataset.pyovos_spec_tools/__init__.pyovos_spec_tools/datasets.pypyproject.tomltest/test_datasets.py
| import hashlib | ||
| import itertools | ||
| import re | ||
| import sys |
There was a problem hiding this comment.
Remove unused import.
The sys module is imported but never referenced in the script.
🧹 Proposed fix
import hashlib
import itertools
import re
-import sys
import unicodedata
from pathlib import Path📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import sys | |
| import hashlib | |
| import itertools | |
| import re | |
| import unicodedata | |
| from pathlib import Path |
🧰 Tools
🪛 GitHub Actions: Lint / 0_lint _ lint.txt
[error] 47-47: Ruff check failed: F401 sys imported but unused
🪛 GitHub Actions: Lint / lint _ lint
[error] 47-47: Ruff check failed: F401 sys imported but unused
🪛 GitHub Check: lint / lint
[failure] 47-47: ruff (F401)
examples/hass-intent-dataset/convert_hassil_intents.py:47:8: F401 sys imported but unused
help: Remove unused import: sys
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/hass-intent-dataset/convert_hassil_intents.py` at line 47, The
import of the sys module is unused; remove the top-level "import sys" statement
so the script no longer contains an unused import (look for the "import sys"
line in convert_hassil_intents.py and delete it).
| "timer_pause": "timer_pause", "timer_unpause": "timer_unpause", | ||
| "timer": "timer", "timers": "timers", | ||
| "hilsen": "greeting", "mine_data": "my_data", | ||
| "abn": "open", "al": "all", |
There was a problem hiding this comment.
Remove duplicate dictionary key.
The key "abn" already appears at line 658 in the same "da" (Danish) dictionary. Python silently overwrites the first entry with the second. While both map to "open" (so the outcome is the same), the duplication is a code smell and may indicate copy-paste error or future maintenance risk.
🔧 Proposed fix
"timer_add": "timer_add", "timer_decrease": "timer_decrease",
"timer_pause": "timer_pause", "timer_unpause": "timer_unpause",
"timer": "timer", "timers": "timers",
"hilsen": "greeting", "mine_data": "my_data",
- "abn": "open", "al": "all",
+ "al": "all",
},
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "abn": "open", "al": "all", | |
| "al": "all", |
🧰 Tools
🪛 GitHub Check: lint / lint
[failure] 702-702: ruff (F601)
examples/hass-intent-dataset/convert_hassil_intents.py:702:9: F601 Dictionary key literal "abn" repeated
help: Remove repeated key literal "abn"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/hass-intent-dataset/convert_hassil_intents.py` at line 702, The
Danish "da" mapping in the dictionary contains a duplicate key "abn" (second
occurrence at the shown diff); remove the redundant "abn": "open" entry so the
"da" dict only has a single "abn" mapping, ensuring no duplicate keys remain in
the dictionary (locate the "da" dict in convert_hassil_intents.py to make the
edit).
| ] | ||
| thin.sort(key=lambda x: x[2] / x[1]) | ||
| if thin: | ||
| print(f" --check: intents with <50% sample survival --") |
There was a problem hiding this comment.
Remove unnecessary f-string prefix.
The string at line 1759 has no placeholders and doesn't need the f prefix.
🧹 Proposed fix
- print(f" --check: intents with <50% sample survival --")
+ print(" --check: intents with <50% sample survival --")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| print(f" --check: intents with <50% sample survival --") | |
| print(" --check: intents with <50% sample survival --") |
🧰 Tools
🪛 GitHub Check: lint / lint
[failure] 1759-1759: ruff (F541)
examples/hass-intent-dataset/convert_hassil_intents.py:1759:19: F541 f-string without any placeholders
help: Remove extraneous f prefix
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/hass-intent-dataset/convert_hassil_intents.py` at line 1759, The
print statement currently uses an unnecessary f-string: replace the f-prefixed
call to print (the line printing " --check: intents with <50% sample survival
--") with a plain string literal to avoid misleading formatting usage; locate
the print(...) invocation in convert_hassil_intents.py and remove the leading
"f" so it becomes print(" --check: intents with <50% sample survival --").
| from __future__ import annotations | ||
|
|
||
| import json | ||
| import os |
There was a problem hiding this comment.
Remove unused import os.
Flagged by the lint check (ruff F401); CI is failing on it.
🧹 Proposed fix
import json
-import os
import re
import sys📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import os | |
| import json | |
| import re | |
| import sys |
🧰 Tools
🪛 GitHub Check: lint / lint
[failure] 33-33: ruff (F401)
examples/hass-intent-dataset/export_hf_dataset.py:33:8: F401 os imported but unused
help: Remove unused import: os
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/hass-intent-dataset/export_hf_dataset.py` at line 33, Remove the
unused import statement "import os" from the top-level imports in
export_hf_dataset.py (the unused symbol is the os import) so the lint error ruff
F401 is resolved; verify there are no references to os elsewhere in the file and
run the linter/CI to confirm the import removal fixes the failing check.
| import json | ||
| import os | ||
| from pathlib import Path | ||
| from collections import defaultdict |
There was a problem hiding this comment.
Remove unused imports.
json, os, and collections.defaultdict are unused (ruff F401, CI failing).
🧹 Proposed fix
from __future__ import annotations
-import json
-import os
+import re
from pathlib import Path
-from collections import defaultdict📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import json | |
| import os | |
| from pathlib import Path | |
| from collections import defaultdict | |
| import re | |
| from pathlib import Path |
🧰 Tools
🪛 GitHub Check: lint / lint
[failure] 9-9: ruff (F401)
examples/hass-intent-dataset/generate_entities.py:9:25: F401 collections.defaultdict imported but unused
help: Remove unused import: collections.defaultdict
[failure] 7-7: ruff (F401)
examples/hass-intent-dataset/generate_entities.py:7:8: F401 os imported but unused
help: Remove unused import: os
[failure] 6-6: ruff (F401)
examples/hass-intent-dataset/generate_entities.py:6:8: F401 json imported but unused
help: Remove unused import: json
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/hass-intent-dataset/generate_entities.py` around lines 6 - 9, The
file imports unused symbols causing lint/CI failures—remove the unused imports
json, os, and defaultdict from the top of generate_entities.py and leave only
the required import(s) (e.g., pathlib.Path) after verifying Path is actually
used; update the import line(s) accordingly so only used modules are imported.
| refs = _extract_keyword_refs(template) | ||
| expansions: list[dict[str, object]] = [] | ||
| for ref in refs: | ||
| vals = vocabs.get(ref) | ||
| if vals: | ||
| expansions.append({"keyword": ref, "values": vals}) | ||
| if expansions: | ||
| row["expansions"] = expansions |
There was a problem hiding this comment.
Duplicate expansions entries for repeated refs.
_extract_keyword_refs returns refs in document order with duplicates, so a template referencing <x> twice emits two identical {keyword, values} entries. De-duplicate while preserving order.
♻️ Proposed fix
- refs = _extract_keyword_refs(template)
+ refs = list(dict.fromkeys(_extract_keyword_refs(template)))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| refs = _extract_keyword_refs(template) | |
| expansions: list[dict[str, object]] = [] | |
| for ref in refs: | |
| vals = vocabs.get(ref) | |
| if vals: | |
| expansions.append({"keyword": ref, "values": vals}) | |
| if expansions: | |
| row["expansions"] = expansions | |
| refs = list(dict.fromkeys(_extract_keyword_refs(template))) | |
| expansions: list[dict[str, object]] = [] | |
| for ref in refs: | |
| vals = vocabs.get(ref) | |
| if vals: | |
| expansions.append({"keyword": ref, "values": vals}) | |
| if expansions: | |
| row["expansions"] = expansions |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/hass-intent-dataset/reexport_uniform.py` around lines 47 - 54, The
code builds expansions from refs returned by _extract_keyword_refs but preserves
duplicates; change the loop that creates expansions from refs so it deduplicates
while preserving order: iterate refs, keep a local seen set, for each ref skip
if already seen, otherwise look up vals = vocabs.get(ref) and if vals append
{"keyword": ref, "values": vals} and add ref to seen, then assign
row["expansions"] only if expansions is non-empty; update the block around
refs/expansions to use this seen-based dedupe.
| """ | ||
| from __future__ import annotations | ||
|
|
||
| import sys |
There was a problem hiding this comment.
Remove unused import sys.
Flagged by ruff F401 (CI failing); argparse is imported locally in main and sys is never used.
🧹 Proposed fix
-import sys
from pathlib import Path📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import sys | |
| from pathlib import Path |
🧰 Tools
🪛 GitHub Check: lint / lint
[failure] 17-17: ruff (F401)
examples/hf_dataset.py:17:8: F401 sys imported but unused
help: Remove unused import: sys
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/hf_dataset.py` at line 17, Remove the unused top-level import "sys"
from examples/hf_dataset.py: delete the "import sys" statement (it is unused and
ruff F401-failing), leaving the local "argparse" usage inside main intact;
verify there are no other references to "sys" such as in function main or
elsewhere before committing.
| SUPPORTED_DATASETS: dict[str, str] = { | ||
| "hassil-intents": "OpenVoiceOS/hassil-intents-locale", | ||
| "intents-for-eval": "OpenVoiceOS/intents-for-eval", | ||
| "massive-templates": "OpenVoiceOS/massive-templates", | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Confirm which repo name appears across the PR
rg -n "hassil-intents-locale|hass-intent-templates"Repository: OpenVoiceOS/ovos-spec-tools
Length of output: 789
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "test_urls_valid" . || true
rg -n "SUPPORTED_DATASETS" ovos_spec_tools/datasets.py . || true
rg -n "docs/datasets\.md|docs/api-reference\.md|docs/README\.md" ovos_spec_tools/datasets.py . || true
python - <<'PY'
from pathlib import Path
p = Path("ovos_spec_tools/datasets.py")
print("datasets.py first 120 lines:")
print("-----")
print("\n".join(p.read_text(encoding="utf-8").splitlines()[:120]))
PYRepository: OpenVoiceOS/ovos-spec-tools
Length of output: 6162
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '160,260p' test/test_datasets.pyRepository: OpenVoiceOS/ovos-spec-tools
Length of output: 1987
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '120,220p' ovos_spec_tools/datasets.pyRepository: OpenVoiceOS/ovos-spec-tools
Length of output: 3186
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '1,120p' ovos_spec_tools/datasets.py
echo "---- docs/datasets.md ----"
rg -n "hassil-intents|hass-intent-templates|SUPPORTED_DATASETS" docs/datasets.md
echo "---- docs/api-reference.md ----"
rg -n "SUPPORTED_DATASETS|hassil-intents|hass-intent-templates" docs/api-reference.md
echo "---- docs/README.md ----"
rg -n "hassil-intents|hass-intent-templates" docs/README.md
echo "---- docs/datasets.md context ----"
sed -n '1,90p' docs/datasets.md
sed -n '160,210p' docs/datasets.mdRepository: OpenVoiceOS/ovos-spec-tools
Length of output: 9136
Fix hassil-intents HF repo mismatch between registry and docs
ovos_spec_tools/datasets.pymaps"hassil-intents"toOpenVoiceOS/hassil-intents-locale, but the module docstring and the docs (docs/datasets.md,docs/api-reference.md,docs/README.md) all documentOpenVoiceOS/hass-intent-templates(per-language configs +expansionscolumn).test_urls_validonly checksOpenVoiceOS/prefix, so it won’t catch this divergence.
🔧 Proposed fix (docs repo is correct)
SUPPORTED_DATASETS: dict[str, str] = {
- "hassil-intents": "OpenVoiceOS/hassil-intents-locale",
+ "hassil-intents": "OpenVoiceOS/hass-intent-templates",
"intents-for-eval": "OpenVoiceOS/intents-for-eval",
"massive-templates": "OpenVoiceOS/massive-templates",
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ovos_spec_tools/datasets.py` around lines 54 - 58, The SUPPORTED_DATASETS
mapping in ovos_spec_tools/datasets.py incorrectly maps the key "hassil-intents"
to "OpenVoiceOS/hassil-intents-locale" while the module docstring and docs
expect "OpenVoiceOS/hass-intent-templates"; update the dictionary entry in
SUPPORTED_DATASETS to use "OpenVoiceOS/hass-intent-templates" for
"hassil-intents" and ensure any related references (module docstring or
consumers of SUPPORTED_DATASETS) remain consistent; run or adjust
test_urls_valid if needed to validate the full repo name rather than only the
"OpenVoiceOS/" prefix.
| def test_round_trip_small(self): | ||
| """Load a single row, export, verify the intent file has the template.""" | ||
| rows = load_dataset_templates("hassil-intents", lang="en", streaming=False) | ||
| # Only keep first 3 rows to make it fast | ||
| first = [rows[0], rows[1]] | ||
|
|
||
| with tempfile.TemporaryDirectory() as tmp: | ||
| dst = Path(tmp) | ||
| with patch( | ||
| "ovos_spec_tools.datasets.load_dataset_templates", | ||
| return_value=first, | ||
| ): | ||
| count = export_to_locale("hassil-intents", "en", dst) | ||
| assert count == 2 | ||
|
|
||
| # Check that first intent file has rows[0] template | ||
| name = rows[0]["intent_id"].split(":")[-1] | ||
| intent_path = dst / "locale" / "en" / f"{name}.intent" | ||
| assert intent_path.exists() | ||
| assert rows[0]["template"] in intent_path.read_text() |
There was a problem hiding this comment.
test_round_trip_small makes a real, un-mocked dataset call — root cause of the CI failures.
Line 173 calls load_dataset_templates("hassil-intents", lang="en", streaming=False) before the patch, so it actually imports datasets and hits HuggingFace. This fails in CI (ImportError: The datasets library is required) and, even with the dependency installed, would make the unit test network-bound, slow, and flaky. Replace the real fetch with fixture rows so the test stays hermetic.
💚 Suggested hermetic rewrite
- def test_round_trip_small(self):
- """Load a single row, export, verify the intent file has the template."""
- rows = load_dataset_templates("hassil-intents", lang="en", streaming=False)
- # Only keep first 3 rows to make it fast
- first = [rows[0], rows[1]]
-
- with tempfile.TemporaryDirectory() as tmp:
+ def test_round_trip_small(self):
+ """Export fixture rows, verify the intent file has the template."""
+ rows = [
+ {"intent_id": "test:greet", "template": "<hello> {name}",
+ "slots": [], "expansions": [{"keyword": "hello", "values": ["hi"]}]},
+ {"intent_id": "test:bye", "template": "<bye> {name}",
+ "slots": [], "expansions": [{"keyword": "bye", "values": ["goodbye"]}]},
+ ]
+ first = [rows[0], rows[1]]
+
+ with tempfile.TemporaryDirectory() as tmp:
dst = Path(tmp)
with patch(
"ovos_spec_tools.datasets.load_dataset_templates",
return_value=first,
):
count = export_to_locale("hassil-intents", "en", dst)
assert count == 2
# Check that first intent file has rows[0] template
name = rows[0]["intent_id"].split(":")[-1]
intent_path = dst / "locale" / "en" / f"{name}.intent"
assert intent_path.exists()
assert rows[0]["template"] in intent_path.read_text()🧰 Tools
🪛 GitHub Actions: Build Tests / 1_build _ build_tests (3.13).txt
[error] 173-173: Pytest failed: TestExportToLocale.test_round_trip_small could not load HuggingFace dataset because datasets is not installed.
🪛 GitHub Actions: Build Tests / 2_build _ build_tests (3.12).txt
[error] 173-173: Pytest failure in TestExportToLocale.test_round_trip_small when calling load_dataset_templates("hassil-intents", lang="en", streaming=False), which raised ImportError because the datasets library is not installed.
🪛 GitHub Actions: Build Tests / 3_build _ build_tests (3.14).txt
[error] 173-173: Test failure: TestExportToLocale.test_round_trip_small raised ImportError because datasets library is not installed.
🪛 GitHub Actions: Build Tests / 4_build _ build_tests (3.10).txt
[error] 173-173: Pytest failure: TestExportToLocale.test_round_trip_small failed when calling load_dataset_templates('hassil-intents', lang='en', streaming=False). Error: ImportError/ModuleNotFoundError for missing 'datasets' library.
🪛 GitHub Actions: Build Tests / 5_build _ build_tests (3.11).txt
[error] 173-173: Pytest failed: TestExportToLocale.test_round_trip_small. load_dataset_templates('hassil-intents', lang='en', streaming=False) raised ImportError because the 'datasets' library is not installed.
🪛 GitHub Actions: Build Tests / build _ build_tests (3.10)
[error] 173-173: Failed test: TestExportToLocale.test_round_trip_small (load_dataset_templates('hassil-intents', lang='en', streaming=False)) because required dependency 'datasets' is not installed.
🪛 GitHub Actions: Build Tests / build _ build_tests (3.11)
[error] 173-173: Failed test: TestExportToLocale.test_round_trip_small. Error: ImportError: The datasets library is required; install it with: pip install datasets.
🪛 GitHub Actions: Build Tests / build _ build_tests (3.12)
[error] 173-173: TestExportToLocale.test_round_trip_small failed while calling load_dataset_templates('hassil-intents', lang='en', streaming=False).
🪛 GitHub Actions: Build Tests / build _ build_tests (3.13)
[error] 173-173: Failure in TestExportToLocale.test_round_trip_small: load_dataset_templates("hassil-intents", lang="en", streaming=False) raised ImportError because the datasets library is not installed.
🪛 GitHub Actions: Build Tests / build _ build_tests (3.14)
[error] 173-173: Pytest failed: TestExportToLocale.test_round_trip_small. load_dataset_templates('hassil-intents', lang='en', streaming=False) raised ImportError because the datasets library is not installed.
🪛 GitHub Actions: Code Coverage / 0_coverage _ coverage.txt
[error] 173-173: TestExportToLocale.test_round_trip_small failed. Called load_dataset_templates("hassil-intents", lang="en", streaming=False) which raised ImportError because the datasets library is not installed.
🪛 GitHub Actions: Code Coverage / coverage _ coverage
[error] 173-173: Failure occurred when calling load_dataset_templates("hassil-intents", lang="en", streaming=False) due to missing datasets library.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/test_datasets.py` around lines 171 - 190, The test test_round_trip_small
currently calls load_dataset_templates before the patch, causing a real dataset
fetch; fix it by removing the real call and creating a local fixture list (e.g.,
a small list of dicts with intent_id and template) to use as rows, then patch
ovos_spec_tools.datasets.load_dataset_templates to return that fixture before
calling export_to_locale; ensure you derive first = fixture[:2] and use
fixture[0]["intent_id"] and fixture[0]["template"] when checking the exported
intent file so the test remains hermetic and does not hit the datasets library.
| def test_urls_valid(self): | ||
| for url in SUPPORTED_DATASETS.values(): | ||
| assert "/" in url | ||
| assert url.startswith("OpenVoiceOS/") |
There was a problem hiding this comment.
test_urls_valid is too weak to catch a wrong repo.
It only asserts the OpenVoiceOS/ prefix and a /, so it passes for both hassil-intents-locale and hass-intent-templates (see the registry mismatch flagged in datasets.py). Consider asserting the exact expected repo names so the registry can't silently drift from the docs.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/test_datasets.py` around lines 204 - 207, The test test_urls_valid is
too permissive—it's only checking for a slash and the "OpenVoiceOS/" prefix on
entries from SUPPORTED_DATASETS, which allows wrong repo names to slip through;
update the test to assert the exact expected repository names for each key in
SUPPORTED_DATASETS (or assert the set of SUPPORTED_DATASETS.values() equals an
expected set/list of repo strings) so the registry cannot silently drift—modify
test_urls_valid to compare SUPPORTED_DATASETS against the precise expected repo
names.
Drop all hardcoded strings from Python scripts — single source of truth is base_locale/<lang>/<slot>.entity files: - generate_base_locale.py: seed file with translation data for 59 languages - base_locale/: 413 .entity files (area, name, color, state, device_class, domain, floor) across 59 languages - convert_hassil_intents.py: read area.entity from base_locale instead of hardcoded COMMON_AREA_NAMES dict - export_hf_dataset.py: slot examples from base_locale/, drop DOMAIN_DEVICE_NAMES and _extract_domain - generate_entities.py: read from base_locale/ instead of _LANG_OVERRIDES and other hardcoded data Numeric slots (brightness, percentage, temperature, ...) are still generated programmatically as they are language-agnostic. HA internal identifiers (device_class, domain) remain English in all languages.
inline_keywords(template, expansions) replaces <keyword> refs with (a|b|c) alternation groups inline — needed for engines like Padatious that don't look up .voc files at runtime. Handles nested refs recursively with configurable max_values cap. Exported from ovos_spec_tools top-level package and documented in the API reference.
…pport" This reverts commit 0a34442.
New module
ovos_spec_tools.datasetsfor loading, expanding, and exporting OVOS-INTENT-2 templates from HuggingFace datasets.What's added
ovos_spec_tools/datasets.pyload_dataset_templates(dataset_id, lang)— loads templates from any of three HF datasets:OpenVoiceOS/hass-intent-templates,intents-for-eval,massive-templatesexpand_hf_template(template, expansions, max_samples)— resolves<keyword>refs,(a|b)alternations,[x]optionals into concrete utterancesexport_to_locale(dataset_id, lang, output_dir)— writes.intent,.voc,.entityfiles to a standard OVOS-INTENT-2 locale treepython -m ovos_spec_tools.datasetsNew example scripts
convert_hassil_intents.pyexport_hf_dataset.pygenerate_entities.py.entityfiles across all languagesreexport_recursive.py<keyword>referencesreexport_uniform.pyhf_dataset.pylocale_to_hf_dataset.pyOther
pyproject.toml: optionaldatasetsextratest/test_datasets.pyAGENTS.md+TODO.mdin-repo docsAPPENDIX.md: formal hassil→OVOS grammar mapping documentationUsage
Verification
OpenVoiceOS/hass-intent-templatesdataset (61 configs, ~224k rows)Summary by CodeRabbit
New Features
Documentation
Tests
Chores
datasetsdependency for advanced dataset features.