Skip to content

feat: Extend ARPAbet coverage to out-of-vocabulary words via G2P cascade #81

@craigtrim

Description

@craigtrim

Background

The current phones_for_word() implementation in pystylometry/prosody/pronouncing.py is a thin wrapper around the CMU Pronouncing Dictionary (~134K entries). For any word not in that dictionary the function returns [], silently dropping the word from all prosodic analysis (syllable counting, stress weighting, beat detection).

This issue captures a design discussion about extending that coverage scientifically.


Why not just switch to IPA?

English-to-IPA converters (e.g. eng-to-ipa) were evaluated. They offer no advantage here because:

  • Most IPA converters use CMU as their backend — same underlying data, different notation.
  • ARPAbet's digit-based stress encoding (AE1, IY0) is easier to parse programmatically than IPA's ˈ prefix marker.
  • The stress alternation detection work already in place (see TestStressVariation in tests/prosody/test_pronouncing_stress.py) works cleanly in ARPAbet.

The notation is not the gap. The gap is coverage.


The real problem: out-of-vocabulary words

phones_for_word("ChatGPT")[]

An author's OOV vocabulary — neologisms, compound coinages, brand names, domain-specific jargon, proper nouns — is silently excluded from prosodic analysis. For a stylometry library this matters because:

  • Author-specific vocabulary is often stylistically significant.
  • OOV words appear disproportionately in the kinds of text stylometry is applied to (academic writing, fiction, marketing copy, technical prose).
  • Measuring the prosodic texture of an author's invented or domain-specific language is a genuine analytical gain over measuring only their common-word usage.

Scientific approach: G2P cascade in ARPAbet

ARPAbet itself is already complete — it covers all ~44 English phonemes. The phoneme inventory does not need extending. What needs extending is the mapping from English orthography to ARPAbet symbols.

This is the Grapheme-to-Phoneme (G2P) problem. A principled cascade keeps all outputs within the ARPAbet framework, preserving comparability with CMU-sourced entries:

Tier 1 — CMU lookup (exact, zero error)

No change from current behaviour.

Tier 2 — Morphological decomposition (separate project — see below)

Split OOV words into known morphemes by inverting a derivation graph:

  • cabbalisticalcabbalisticcabbalist (CMU lookup succeeds)
  • unhappinesshappinesshappy (CMU lookup succeeds)
  • Handles the allomorphic alternation problem without rules — the inversion encodes surface changes directly.

Tier 3 — Suffix/pattern rules

English stress is strongly conditioned by morphology:

  • -tion / -sion → stress on preceding syllable (Chomsky & Halle, The Sound Pattern of English, 1968)
  • -ic → stress on self
  • Compound nouns → primary stress on first element
  • Acronyms → spell out letter by letter (each letter has a CMU entry)

This tier handles jargon, coinages, and technical terms with high accuracy.

Tier 4 — Neural G2P model (fallback)

For true unknowns. The CMU dict is 134K word→ARPAbet pairs — excellent training data for a seq2seq character→phoneme model.

Candidate library: g2p-en — trained on CMU, outputs ARPAbet, near-drop-in for the pronouncing.py abstraction. Callers would not know which tier handled a given word.


Tier 2: Extracted to a new standalone project

Tier 2 requires a derivation graph with inverted index — mapping any derived form back to a CMU-lookable ancestor. This is non-trivial enough to warrant its own project and will be consumed by pystylometry as a dependency once built.

Why no existing library does this

Research into existing libraries found no suitable off-the-shelf solution for reversible morpheme boundary detection.

The allomorphic alternation problem makes naive suffix stripping fail:

  • happiness is not happy + ness at the surface — the y became i
  • running is not run + ing — the consonant doubled
  • decision is not decide + sion — the stem changed shape entirely
Library Approach Why it falls short
morfessor Unsupervised corpus statistics Splits statistically, not CMU-lookable. Produces happi not happiness.
spaCy / stanza Inflectional morphology only Handles walks → walk, not derivational decomposition.
NLTK stemmers Destructive (Porter, Snowball) Produces non-words by design.
wordsegment Frequency compound splitting No affix awareness.
g2p-en CMU + neural Skips morphological tier entirely (Tier 1 + Tier 4 only).

Prior art: lemmastem

An existing repo (craigtrim/lemmastem, "Unigram Lexicon segmented by Lemmas and Stems") was evaluated. It contains a derivation graph organised as root→derived with intermediate forms also promoted to top-level keys. The ca bigram file alone has ~23K entries; total data is ~56MB of auto-generated Python.

However the repo gives no methodological provenance — two commits ("Project Init", "Initialize"), no README content, no documentation, and only # AUTO-GENERATED CONTENT comments in the data files. The source corpus is unknown; the presence of OCR artifacts and unusual character sequences suggests a broad web-crawl unigram list rather than a curated lexical resource.

It is not a suitable foundation. The new project should be built from scratch with documented, academically defensible sources.

Planned source data for the new project

Source Role
WordNet Primary derivation structure. Human-curated morphosemantic links (happy → happiness → unhappiness). No noise.
BNC Frequency-ranked British English. Coverage of academic and formal register — the primary domain of stylometry.
Google Unigrams Scale and contemporary coverage. Neologisms, brand names, technical terms that WordNet and BNC miss.

Synthesis: Use WordNet for derivation graph structure; BNC and Google Unigrams to determine which forms are worth indexing (frequency threshold filters OCR noise and hapax legomena). Build inverted index from result.

What the new project should produce

An inverted derivation index: derived_form → immediate_parent for all forms in the graph, plus a root(word) function that chains parent pointers until a CMU-lookable ancestor is found. Consumed by pystylometry's pronouncing.py as a dependency behind phones_for_word().


Why this is scientifically defensible for stylometry

  • All outputs remain in ARPAbet → derived metrics (syllable count, stress position, beat weight) are directly comparable between exact-lookup and G2P-predicted words.
  • No mixed phonological assumptions or notation systems.
  • Stress prediction accuracy is highest for morphologically regular words, which are the majority of author OOV vocabulary.

Honest limitation

Proper nouns — place names, surnames — follow no consistent phonological rule. These are the hardest cases. For stylometry this residual gap is acceptable: proper nouns vary by topic, not author style.


Implementation path (this repo)

The pronouncing.py module is already the right abstraction layer. Once the Tier 2 project exists, the cascade is added behind phones_for_word() — all callers are unaffected. A phones_source field on results could optionally expose which tier was used (useful for downstream confidence weighting).

Suggested module: pystylometry/prosody/g2p_cascade.py, calling the Tier 2 project as a dependency.

This issue is blocked on the new Tier 2 project being built.


Related Issues

References

  • Chomsky, N. & Halle, M. (1968). The Sound Pattern of English. Harper & Row.
  • Liberman, M. & Prince, A. (1977). On stress and linguistic rhythm. Linguistic Inquiry, 8(2), 249–336.
  • Weide, R. L. (1998). The CMU Pronouncing Dictionary, release 0.6d. Carnegie Mellon University.
  • g2p-en: https://github.com/Kyubyong/g2p
  • morfessor: https://github.com/aalto-speech/morfessor

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions