Release 0.9.1a1 by github-actions[bot] · Pull Request #8 · TigreGotico/simple_NER

github-actions · 2026-03-31T03:06:42Z

Human review requested!

AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Fix broken legacy build backend so package installs via uv - Impact: build-backend changed to setuptools.build_meta; setuptools-scm added to build-system.requires - Verified via: uv pip install -e .

…ints AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Remove dynamic class creation at runtime; modernise with type hints and docstrings - Impact: _SimpleNamespace replaces anonymous Entity subclasses for nested data dicts; no behavioural change for flat data keys; full type annotations on Entity, SimpleNER, find_all - Verified via: uv run pytest test/test_core.py -v (35 passed)

…ERWrapper AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Python 3.10+ type annotations and docstrings for all public classes in rules/annotators - Impact: RegexNER._create_regex now returns None on re.error and callers skip gracefully; no behavioural changes otherwise - Verified via: uv run pytest test/test_core.py -v (35 passed)

…R, NERWrapper AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Provide core test coverage with zero optional dependencies - Impact: 35 tests cover construction, span lookup, rule extraction, regex extraction, wrapper aggregation, as_json output, and edge cases (invalid regex, partial-word boundary) - Verified via: uv run pytest test/test_core.py -v (35 passed)

AI-Generated Change: - Model: claude-sonnet-4-6 - Intent: Document architecture, annotator table, install/usage, and log all Phase 1 changes - Impact: docs/index.md created with source citations; MAINTENANCE_REPORT.md records all AI actions and test results - Verified via: file review

- HashtagAnnotator: re.UNICODE flag; \w pattern matches any script (Arabic, Japanese, Chinese, Cyrillic, etc.) - BaseAnnotator.__init__: added lang="en-us" param; subclasses propagate via super().__init__(lang=lang) instead of self.lang - SimpleNERIntentTransformer: resolves lang from OVOS session (intent.updated_session → SessionManager → config fallback); _get_pipeline(lang) rebuilds only on language change - AUDIT.md: documented TECH-009 CurrencyAnnotator char-class bug - docs/FAQ.md: expanded language support to per-annotator table - tests: 20 new multilingual/Unicode tests; 203 passing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- numbers_ner/temporal_ner: add compat shim for ovos-number-parser rename convert_words_to_numbers → numbers_to_digits; detect at import time via inspect.signature; map short_scale to Scale enum - temporal_ner: update all ovos-date-parser calls to new positional lang signature (extract_datetime/duration/nice_date/nice_duration) - currency_ner: fix TECH-009 — R$/A$/C$ multi-char symbols now use regex alternation instead of character class; _parse_currency sorts symbols longest-first; pattern built by _build_pattern() classmethod - 3 previously skipped temporal tests now pass (206 total, 0 skipped) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- url_ner: extend URL_PATTERN to Unicode label chars (Latin Extended, Cyrillic, CJK, Hiragana, Katakana) + re.UNICODE flag; IDN domains like https://münchen.de now detected - names_ner: add _STOPWORDS frozenset (~50 entries) filtering common capitalised English non-names (The, Store, Monday, January etc.) to cut false positives at sentence boundaries Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace hardcoded ~50-word set with _load_stopwords_iso() reading stopwords-iso.json directly (bypasses pkg_resources bug on Py3.13). Loads 2590 EN stopwords (lower + Title case) at class definition; graceful fallback to minimal hardcoded set if package unavailable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… guards - lookup_ner: use ahocorasick-ner as O(N) backend; regex fallback if absent; rebuild automaton on add/remove_wordlist; start/end positions in entity data - temporal_ner: load temporal keywords from res/<lang>/temporal_keywords.txt instead of hardcoded set; False-positive guard skips diff spans with no temporal keyword or ordinal (e.g. currency amounts parsed as clock times) - numbers_ner: skip diffs where replacement is not a pure number (fixes spurious written_number matches on emails/phone after number normalisation) - res/en-us/temporal_keywords.txt: 42 English temporal keywords - res/{de-de,es-es,fr-fr}/temporal_keywords.txt: German, Spanish, French - pyproject.toml: add ahocorasick-ner>=0.1.1 to dependencies Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…t, __all__ - locations_ner: replace O(N_words×N_cities) word-loop with AhocorasickNER automaton; multi-word names (New York, United States, Los Angeles) now detected; _legacy_extract removed - lookup_ner: drop try/except import and regex fallback; ahocorasick-ner is now a hard dependency; annotate() simplified - phone_ner: add _EXT suffix pattern for x123 / ext. 456 extensions - __init__.py: add __all__ = ["Entity", "SimpleNER"] - SUGGESTIONS.md: 10 tracked improvement proposals Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replaces exact-span grouping in _deduplicate with a greedy longest-span-wins algorithm that handles overlaps across different annotators. Ties resolved by confidence then annotator order. Entities without span info are passed through unchanged. Adds _resolve_span() helper that prefers data["start"]/data["end"] over entity.spans for speed. Also adds phone.rx patterns for space-separated international numbers (S-003): - +44 20 7946 0958 and +33 1 23 45 67 89 now matched - Pattern 3: \+\d{1,3}(?:[\s-]\d{1,4}){2,6} - Pattern 4: \+\d{1,3}\s\d{2,5}\s\d{3,4}\s\d{4} 17 new tests in test/test_pipeline_overlap_dedup.py; 323/323 passing. AI-Generated Change: - Model: Claude Sonnet 4.6 - Intent: Fix cross-annotator span collision (S-006) and missing intl phone formats (S-003) - Impact: _deduplicate now resolves overlaps correctly; PhoneAnnotator matches EU/UK space formats - Verified via: uv run pytest test/ -q (323 passed)

- Add simple_NER/utils/locale.py: load_rx(), load_intents(), load_wordlist() - Add locale/en-us/ and locale/de-de/ with .rx, .intent, .txt files - Wire PhoneAnnotator, CurrencyAnnotator, OrganizationAnnotator, DateAnnotator to locale - S-002: longest-match-wins dedup in LocationNER (York vs New York) - S-003: space-separated international phone formats (+44 20 7946 0958) - S-004: temporal-keyword guard applied to duration extraction - S-005: per-label confidence in LookUpNER and LocationNER - S-006: cross-annotator span-overlap dedup in NERPipeline - S-010: EU decimal notation in CurrencyAnnotator (1.000,50) - 135 new tests: 206 → 341 passing; numbers_ner 28→84%, lookup 72→91% Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Use difflib.SequenceMatcher to map converted digit spans back to character positions in the original text. NumberNER entities now participate in pipeline cross-annotator overlap dedup (TECH-011). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- date_ner.py: 73% → 92% — all format branches, full _is_valid_date validation (leap years, month/day bounds, century rules) - hashtag_ner.py: 79% → 96% — edge cases and classification branches - pipeline.py: 79% → 92% — Span helpers, select_entity, async dedup - utils/locale.py: 76% → 99% — malformed regex skip, missing file paths 52 new tests; total 350 → 402 passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- README.md rewritten — install, quick start, annotator table, dedup strategies, locale/i18n, async, OVOS plugin - docs/index.md — full API reference with constructor params, data fields per annotator, locale system, BaseAnnotator extension guide - docs/TUTORIALS.md — 8 end-to-end tutorials with sample output - examples/01-12: every annotator, dedup strategies, custom keywords, LocationNER label_confidence, TemporalNER anchor_date, multilang currency, async batch, custom annotator subclass, OVOS plugin, LookUpNER runtime wordlists, locale utilities direct usage - examples/README.md — index with run commands Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

temporal_keywords.txt: - Add it-it (43 kw), nl-nl (42 kw), pt-pt (43 kw) locale files per language (es-es, fr-fr, it-it, nl-nl, pt-pt): - currency.intent — native written currency templates - currency.rx — EU decimal format (dot-thousands, comma-decimal) - organization.rx — country-specific legal suffixes + university patterns - phone.rx — country-specific phone formats + country code variants locale/de-de: - phone.rx — +49 and 0xxx German formats Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Rename locale/res dirs to bare language codes (de-de → de, en-us → en) - locale loader normalises lang via .split('-')[0].lower() — 'de-DE', 'de-de', 'de' all resolve to locale/de/ - Add 16 new languages: da, el, eu, fa, gl, hu, lt, pl, ro, ru, sv, tr, uk, an, ast, mwl - Each language gets: date_months.txt, currency.intent, currency.rx, organization.rx, phone.rx, temporal_keywords.txt - Custom currency.rx for fa (ریال/تومان), ru (₽), uk (₴), tr (₺) - Merge regional variants (es-419, nl-be, pt-br, pt-ao, sv-fi) into primary bare-code dirs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add simple_NER/version.py with OVOS version block (v0.9.0 stable) - Wire pyproject.toml dynamic version from simple_NER.version.__version__ - Add standard workflows: release_workflow, publish_stable, build-tests, lint, coverage, release-preview, repo-health, license_check, pip_audit, opm-check, conventional-label - Remove legacy build_tests.yml and license_tests.yml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ists for 24 languages AI-Generated Change: - Model: Claude Sonnet 4.6 - Intent: Close resolved AUDIT.md issue and expand LookUpNER coverage across all supported locales - Impact: - AUDIT.md: TECH-011 moved from Open Issues to Resolved (commit a5bac24, 2026-03-31) - Added color/emotion/weather/animal.entity for de, es, fr, it, nl, pt, ca, cs, da, el, eu, fa, gl, hu, lt, mwl, pl, ro, ru, sv, tr, uk, an, ast (96 new files; total entity count: 107) - Non-Latin scripts (el, fa, ru, uk) use native script throughout - Minority languages (an, ast, mwl, eu, gl, ca) use accurate regional vocabulary - Verified via: uv run pytest test/ -q → 402 passed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

JarbasAl and others added 30 commits April 12, 2020 19:46

add NamesNER

a7aca3f

readme

ee9635b

readme

0f66393

readme

74d3085

0.5

5394a6c

reorganize package structure

067a49a

reorganize package structure

f2f80aa

add logger

008cda2

readme

9894c4d

readme

544528e

split datetime and timedelta into different extractors

678f000

replace padaos with simplematch

53d730d

replace lingua_franca with lingua_nostra

f8a76a3

initial quebra_frases integration

1a18a0f

Rake confidence

9010aa5

improve Noun confidence

435890d

lang support

f2edccc

RAKEkeywords>=0.2.0

04d6e1a

quebra_frases

fcd31cb

license tests workflow

430c7a4

JarbasAl and others added 13 commits March 31, 2026 01:17

Add renovate.json (#6)

e2f008f

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

Increment Version to 0.9.1a1

c64bd8a

Update Changelog

3b8dc25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 0.9.1a1#8

Release 0.9.1a1#8
github-actions[bot] wants to merge 43 commits intomasterfrom
release-0.9.1a1

github-actions bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

github-actions bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant