Open
Conversation
AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Fix broken legacy build backend so package installs via uv - Impact: build-backend changed to setuptools.build_meta; setuptools-scm added to build-system.requires - Verified via: uv pip install -e .
…ints AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Remove dynamic class creation at runtime; modernise with type hints and docstrings - Impact: _SimpleNamespace replaces anonymous Entity subclasses for nested data dicts; no behavioural change for flat data keys; full type annotations on Entity, SimpleNER, find_all - Verified via: uv run pytest test/test_core.py -v (35 passed)
…ERWrapper AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Python 3.10+ type annotations and docstrings for all public classes in rules/annotators - Impact: RegexNER._create_regex now returns None on re.error and callers skip gracefully; no behavioural changes otherwise - Verified via: uv run pytest test/test_core.py -v (35 passed)
…R, NERWrapper AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Provide core test coverage with zero optional dependencies - Impact: 35 tests cover construction, span lookup, rule extraction, regex extraction, wrapper aggregation, as_json output, and edge cases (invalid regex, partial-word boundary) - Verified via: uv run pytest test/test_core.py -v (35 passed)
AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Document architecture, annotator table, install/usage, and log all Phase 1 changes - Impact: docs/index.md created with source citations; MAINTENANCE_REPORT.md records all AI actions and test results - Verified via: file review
- HashtagAnnotator: re.UNICODE flag; \w pattern matches any script (Arabic, Japanese, Chinese, Cyrillic, etc.) - BaseAnnotator.__init__: added lang="en-us" param; subclasses propagate via super().__init__(lang=lang) instead of self.lang - SimpleNERIntentTransformer: resolves lang from OVOS session (intent.updated_session → SessionManager → config fallback); _get_pipeline(lang) rebuilds only on language change - AUDIT.md: documented TECH-009 CurrencyAnnotator char-class bug - docs/FAQ.md: expanded language support to per-annotator table - tests: 20 new multilingual/Unicode tests; 203 passing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- numbers_ner/temporal_ner: add compat shim for ovos-number-parser rename convert_words_to_numbers → numbers_to_digits; detect at import time via inspect.signature; map short_scale to Scale enum - temporal_ner: update all ovos-date-parser calls to new positional lang signature (extract_datetime/duration/nice_date/nice_duration) - currency_ner: fix TECH-009 — R$/A$/C$ multi-char symbols now use regex alternation instead of character class; _parse_currency sorts symbols longest-first; pattern built by _build_pattern() classmethod - 3 previously skipped temporal tests now pass (206 total, 0 skipped) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- url_ner: extend URL_PATTERN to Unicode label chars (Latin Extended, Cyrillic, CJK, Hiragana, Katakana) + re.UNICODE flag; IDN domains like https://münchen.de now detected - names_ner: add _STOPWORDS frozenset (~50 entries) filtering common capitalised English non-names (The, Store, Monday, January etc.) to cut false positives at sentence boundaries Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace hardcoded ~50-word set with _load_stopwords_iso() reading stopwords-iso.json directly (bypasses pkg_resources bug on Py3.13). Loads 2590 EN stopwords (lower + Title case) at class definition; graceful fallback to minimal hardcoded set if package unavailable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… guards
- lookup_ner: use ahocorasick-ner as O(N) backend; regex fallback if absent;
rebuild automaton on add/remove_wordlist; start/end positions in entity data
- temporal_ner: load temporal keywords from res/<lang>/temporal_keywords.txt
instead of hardcoded set; False-positive guard skips diff spans with no
temporal keyword or ordinal (e.g. currency amounts parsed as clock times)
- numbers_ner: skip diffs where replacement is not a pure number (fixes
spurious written_number matches on emails/phone after number normalisation)
- res/en-us/temporal_keywords.txt: 42 English temporal keywords
- res/{de-de,es-es,fr-fr}/temporal_keywords.txt: German, Spanish, French
- pyproject.toml: add ahocorasick-ner>=0.1.1 to dependencies
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t, __all__ - locations_ner: replace O(N_words×N_cities) word-loop with AhocorasickNER automaton; multi-word names (New York, United States, Los Angeles) now detected; _legacy_extract removed - lookup_ner: drop try/except import and regex fallback; ahocorasick-ner is now a hard dependency; annotate() simplified - phone_ner: add _EXT suffix pattern for x123 / ext. 456 extensions - __init__.py: add __all__ = ["Entity", "SimpleNER"] - SUGGESTIONS.md: 10 tracked improvement proposals Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces exact-span grouping in _deduplicate with a greedy longest-span-wins
algorithm that handles overlaps across different annotators. Ties resolved by
confidence then annotator order. Entities without span info are passed through
unchanged. Adds _resolve_span() helper that prefers data["start"]/data["end"]
over entity.spans for speed.
Also adds phone.rx patterns for space-separated international numbers (S-003):
- +44 20 7946 0958 and +33 1 23 45 67 89 now matched
- Pattern 3: \+\d{1,3}(?:[\s-]\d{1,4}){2,6}
- Pattern 4: \+\d{1,3}\s\d{2,5}\s\d{3,4}\s\d{4}
17 new tests in test/test_pipeline_overlap_dedup.py; 323/323 passing.
AI-Generated Change:
- Model: Claude Sonnet 4.6
- Intent: Fix cross-annotator span collision (S-006) and missing intl phone formats (S-003)
- Impact: _deduplicate now resolves overlaps correctly; PhoneAnnotator matches EU/UK space formats
- Verified via: uv run pytest test/ -q (323 passed)
- Add simple_NER/utils/locale.py: load_rx(), load_intents(), load_wordlist() - Add locale/en-us/ and locale/de-de/ with .rx, .intent, .txt files - Wire PhoneAnnotator, CurrencyAnnotator, OrganizationAnnotator, DateAnnotator to locale - S-002: longest-match-wins dedup in LocationNER (York vs New York) - S-003: space-separated international phone formats (+44 20 7946 0958) - S-004: temporal-keyword guard applied to duration extraction - S-005: per-label confidence in LookUpNER and LocationNER - S-006: cross-annotator span-overlap dedup in NERPipeline - S-010: EU decimal notation in CurrencyAnnotator (1.000,50) - 135 new tests: 206 → 341 passing; numbers_ner 28→84%, lookup 72→91% Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Use difflib.SequenceMatcher to map converted digit spans back to character positions in the original text. NumberNER entities now participate in pipeline cross-annotator overlap dedup (TECH-011). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- date_ner.py: 73% → 92% — all format branches, full _is_valid_date validation (leap years, month/day bounds, century rules) - hashtag_ner.py: 79% → 96% — edge cases and classification branches - pipeline.py: 79% → 92% — Span helpers, select_entity, async dedup - utils/locale.py: 76% → 99% — malformed regex skip, missing file paths 52 new tests; total 350 → 402 passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- README.md rewritten — install, quick start, annotator table, dedup strategies, locale/i18n, async, OVOS plugin - docs/index.md — full API reference with constructor params, data fields per annotator, locale system, BaseAnnotator extension guide - docs/TUTORIALS.md — 8 end-to-end tutorials with sample output - examples/01-12: every annotator, dedup strategies, custom keywords, LocationNER label_confidence, TemporalNER anchor_date, multilang currency, async batch, custom annotator subclass, OVOS plugin, LookUpNER runtime wordlists, locale utilities direct usage - examples/README.md — index with run commands Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
temporal_keywords.txt: - Add it-it (43 kw), nl-nl (42 kw), pt-pt (43 kw) locale files per language (es-es, fr-fr, it-it, nl-nl, pt-pt): - currency.intent — native written currency templates - currency.rx — EU decimal format (dot-thousands, comma-decimal) - organization.rx — country-specific legal suffixes + university patterns - phone.rx — country-specific phone formats + country code variants locale/de-de: - phone.rx — +49 and 0xxx German formats Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Rename locale/res dirs to bare language codes (de-de → de, en-us → en)
- locale loader normalises lang via .split('-')[0].lower() — 'de-DE',
'de-de', 'de' all resolve to locale/de/
- Add 16 new languages: da, el, eu, fa, gl, hu, lt, pl, ro, ru, sv,
tr, uk, an, ast, mwl
- Each language gets: date_months.txt, currency.intent, currency.rx,
organization.rx, phone.rx, temporal_keywords.txt
- Custom currency.rx for fa (ریال/تومان), ru (₽), uk (₴), tr (₺)
- Merge regional variants (es-419, nl-be, pt-br, pt-ao, sv-fi) into
primary bare-code dirs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add simple_NER/version.py with OVOS version block (v0.9.0 stable) - Wire pyproject.toml dynamic version from simple_NER.version.__version__ - Add standard workflows: release_workflow, publish_stable, build-tests, lint, coverage, release-preview, repo-health, license_check, pip_audit, opm-check, conventional-label - Remove legacy build_tests.yml and license_tests.yml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ists for 24 languages AI-Generated Change: - Model: Claude Sonnet 4.6 - Intent: Close resolved AUDIT.md issue and expand LookUpNER coverage across all supported locales - Impact: - AUDIT.md: TECH-011 moved from Open Issues to Resolved (commit a5bac24, 2026-03-31) - Added color/emotion/weather/animal.entity for de, es, fr, it, nl, pt, ca, cs, da, el, eu, fa, gl, hu, lt, mwl, pl, ro, ru, sv, tr, uk, an, ast (96 new files; total entity count: 107) - Non-Latin scripts (el, fa, ru, uk) use native script throughout - Minority languages (an, ast, mwl, eu, gl, ca) use accurate regional vocabulary - Verified via: uv run pytest test/ -q → 402 passed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Human review requested!