Skip to content

Release 0.9.1a1#8

Open
github-actions[bot] wants to merge 43 commits intomasterfrom
release-0.9.1a1
Open

Release 0.9.1a1#8
github-actions[bot] wants to merge 43 commits intomasterfrom
release-0.9.1a1

Conversation

@github-actions
Copy link
Copy Markdown

Human review requested!

JarbasAl and others added 30 commits April 12, 2020 19:46
AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Fix broken legacy build backend so package installs via uv
- Impact: build-backend changed to setuptools.build_meta; setuptools-scm added to build-system.requires
- Verified via: uv pip install -e .
…ints

AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Remove dynamic class creation at runtime; modernise with type hints and docstrings
- Impact: _SimpleNamespace replaces anonymous Entity subclasses for nested data dicts; no behavioural change for flat data keys; full type annotations on Entity, SimpleNER, find_all
- Verified via: uv run pytest test/test_core.py -v (35 passed)
…ERWrapper

AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Python 3.10+ type annotations and docstrings for all public classes in rules/annotators
- Impact: RegexNER._create_regex now returns None on re.error and callers skip gracefully; no behavioural changes otherwise
- Verified via: uv run pytest test/test_core.py -v (35 passed)
…R, NERWrapper

AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Provide core test coverage with zero optional dependencies
- Impact: 35 tests cover construction, span lookup, rule extraction, regex extraction, wrapper aggregation, as_json output, and edge cases (invalid regex, partial-word boundary)
- Verified via: uv run pytest test/test_core.py -v (35 passed)
AI-Generated Change:
- Model: claude-sonnet-4-6
- Intent: Document architecture, annotator table, install/usage, and log all Phase 1 changes
- Impact: docs/index.md created with source citations; MAINTENANCE_REPORT.md records all AI actions and test results
- Verified via: file review
- HashtagAnnotator: re.UNICODE flag; \w pattern matches any script
  (Arabic, Japanese, Chinese, Cyrillic, etc.)
- BaseAnnotator.__init__: added lang="en-us" param; subclasses
  propagate via super().__init__(lang=lang) instead of self.lang
- SimpleNERIntentTransformer: resolves lang from OVOS session
  (intent.updated_session → SessionManager → config fallback);
  _get_pipeline(lang) rebuilds only on language change
- AUDIT.md: documented TECH-009 CurrencyAnnotator char-class bug
- docs/FAQ.md: expanded language support to per-annotator table
- tests: 20 new multilingual/Unicode tests; 203 passing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- numbers_ner/temporal_ner: add compat shim for ovos-number-parser
  rename convert_words_to_numbers → numbers_to_digits; detect at
  import time via inspect.signature; map short_scale to Scale enum
- temporal_ner: update all ovos-date-parser calls to new positional
  lang signature (extract_datetime/duration/nice_date/nice_duration)
- currency_ner: fix TECH-009 — R$/A$/C$ multi-char symbols now use
  regex alternation instead of character class; _parse_currency sorts
  symbols longest-first; pattern built by _build_pattern() classmethod
- 3 previously skipped temporal tests now pass (206 total, 0 skipped)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- url_ner: extend URL_PATTERN to Unicode label chars (Latin Extended,
  Cyrillic, CJK, Hiragana, Katakana) + re.UNICODE flag; IDN domains
  like https://münchen.de now detected
- names_ner: add _STOPWORDS frozenset (~50 entries) filtering common
  capitalised English non-names (The, Store, Monday, January etc.)
  to cut false positives at sentence boundaries

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace hardcoded ~50-word set with _load_stopwords_iso() reading
stopwords-iso.json directly (bypasses pkg_resources bug on Py3.13).
Loads 2590 EN stopwords (lower + Title case) at class definition;
graceful fallback to minimal hardcoded set if package unavailable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… guards

- lookup_ner: use ahocorasick-ner as O(N) backend; regex fallback if absent;
  rebuild automaton on add/remove_wordlist; start/end positions in entity data
- temporal_ner: load temporal keywords from res/<lang>/temporal_keywords.txt
  instead of hardcoded set; False-positive guard skips diff spans with no
  temporal keyword or ordinal (e.g. currency amounts parsed as clock times)
- numbers_ner: skip diffs where replacement is not a pure number (fixes
  spurious written_number matches on emails/phone after number normalisation)
- res/en-us/temporal_keywords.txt: 42 English temporal keywords
- res/{de-de,es-es,fr-fr}/temporal_keywords.txt: German, Spanish, French
- pyproject.toml: add ahocorasick-ner>=0.1.1 to dependencies

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
JarbasAl and others added 13 commits March 31, 2026 01:17
…t, __all__

- locations_ner: replace O(N_words×N_cities) word-loop with AhocorasickNER
  automaton; multi-word names (New York, United States, Los Angeles) now
  detected; _legacy_extract removed
- lookup_ner: drop try/except import and regex fallback; ahocorasick-ner
  is now a hard dependency; annotate() simplified
- phone_ner: add _EXT suffix pattern for x123 / ext. 456 extensions
- __init__.py: add __all__ = ["Entity", "SimpleNER"]
- SUGGESTIONS.md: 10 tracked improvement proposals

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces exact-span grouping in _deduplicate with a greedy longest-span-wins
algorithm that handles overlaps across different annotators. Ties resolved by
confidence then annotator order. Entities without span info are passed through
unchanged. Adds _resolve_span() helper that prefers data["start"]/data["end"]
over entity.spans for speed.

Also adds phone.rx patterns for space-separated international numbers (S-003):
- +44 20 7946 0958 and +33 1 23 45 67 89 now matched
- Pattern 3: \+\d{1,3}(?:[\s-]\d{1,4}){2,6}
- Pattern 4: \+\d{1,3}\s\d{2,5}\s\d{3,4}\s\d{4}

17 new tests in test/test_pipeline_overlap_dedup.py; 323/323 passing.

AI-Generated Change:
- Model: Claude Sonnet 4.6
- Intent: Fix cross-annotator span collision (S-006) and missing intl phone formats (S-003)
- Impact: _deduplicate now resolves overlaps correctly; PhoneAnnotator matches EU/UK space formats
- Verified via: uv run pytest test/ -q (323 passed)
- Add simple_NER/utils/locale.py: load_rx(), load_intents(), load_wordlist()
- Add locale/en-us/ and locale/de-de/ with .rx, .intent, .txt files
- Wire PhoneAnnotator, CurrencyAnnotator, OrganizationAnnotator, DateAnnotator to locale
- S-002: longest-match-wins dedup in LocationNER (York vs New York)
- S-003: space-separated international phone formats (+44 20 7946 0958)
- S-004: temporal-keyword guard applied to duration extraction
- S-005: per-label confidence in LookUpNER and LocationNER
- S-006: cross-annotator span-overlap dedup in NERPipeline
- S-010: EU decimal notation in CurrencyAnnotator (1.000,50)
- 135 new tests: 206 → 341 passing; numbers_ner 28→84%, lookup 72→91%

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Use difflib.SequenceMatcher to map converted digit spans back to
character positions in the original text. NumberNER entities now
participate in pipeline cross-annotator overlap dedup (TECH-011).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- date_ner.py: 73% → 92% — all format branches, full _is_valid_date
  validation (leap years, month/day bounds, century rules)
- hashtag_ner.py: 79% → 96% — edge cases and classification branches
- pipeline.py: 79% → 92% — Span helpers, select_entity, async dedup
- utils/locale.py: 76% → 99% — malformed regex skip, missing file paths

52 new tests; total 350 → 402 passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- README.md rewritten — install, quick start, annotator table, dedup
  strategies, locale/i18n, async, OVOS plugin
- docs/index.md — full API reference with constructor params, data
  fields per annotator, locale system, BaseAnnotator extension guide
- docs/TUTORIALS.md — 8 end-to-end tutorials with sample output
- examples/01-12: every annotator, dedup strategies, custom keywords,
  LocationNER label_confidence, TemporalNER anchor_date, multilang
  currency, async batch, custom annotator subclass, OVOS plugin,
  LookUpNER runtime wordlists, locale utilities direct usage
- examples/README.md — index with run commands

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
temporal_keywords.txt:
- Add it-it (43 kw), nl-nl (42 kw), pt-pt (43 kw)

locale files per language (es-es, fr-fr, it-it, nl-nl, pt-pt):
- currency.intent — native written currency templates
- currency.rx     — EU decimal format (dot-thousands, comma-decimal)
- organization.rx — country-specific legal suffixes + university patterns
- phone.rx        — country-specific phone formats + country code variants

locale/de-de:
- phone.rx        — +49 and 0xxx German formats

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Rename locale/res dirs to bare language codes (de-de → de, en-us → en)
- locale loader normalises lang via .split('-')[0].lower() — 'de-DE',
  'de-de', 'de' all resolve to locale/de/
- Add 16 new languages: da, el, eu, fa, gl, hu, lt, pl, ro, ru, sv,
  tr, uk, an, ast, mwl
- Each language gets: date_months.txt, currency.intent, currency.rx,
  organization.rx, phone.rx, temporal_keywords.txt
- Custom currency.rx for fa (ریال/تومان), ru (₽), uk (₴), tr (₺)
- Merge regional variants (es-419, nl-be, pt-br, pt-ao, sv-fi) into
  primary bare-code dirs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add simple_NER/version.py with OVOS version block (v0.9.0 stable)
- Wire pyproject.toml dynamic version from simple_NER.version.__version__
- Add standard workflows: release_workflow, publish_stable, build-tests,
  lint, coverage, release-preview, repo-health, license_check, pip_audit,
  opm-check, conventional-label
- Remove legacy build_tests.yml and license_tests.yml

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ists for 24 languages

AI-Generated Change:
- Model: Claude Sonnet 4.6
- Intent: Close resolved AUDIT.md issue and expand LookUpNER coverage across all supported locales
- Impact:
  - AUDIT.md: TECH-011 moved from Open Issues to Resolved (commit a5bac24, 2026-03-31)
  - Added color/emotion/weather/animal.entity for de, es, fr, it, nl, pt, ca, cs, da, el, eu, fa,
    gl, hu, lt, mwl, pl, ro, ru, sv, tr, uk, an, ast (96 new files; total entity count: 107)
  - Non-Latin scripts (el, fa, ru, uk) use native script throughout
  - Minority languages (an, ast, mwl, eu, gl, ca) use accurate regional vocabulary
- Verified via: uv run pytest test/ -q → 402 passed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant