feat: Extend ARPAbet coverage to out-of-vocabulary words via G2P cascade

## Background

The current `phones_for_word()` implementation in `pystylometry/prosody/pronouncing.py` is a thin wrapper around the CMU Pronouncing Dictionary (~134K entries). For any word not in that dictionary the function returns `[]`, silently dropping the word from all prosodic analysis (syllable counting, stress weighting, beat detection).

This issue captures a design discussion about extending that coverage scientifically.

---

## Why not just switch to IPA?

English-to-IPA converters (e.g. `eng-to-ipa`) were evaluated. They offer no advantage here because:

- Most IPA converters use CMU as their backend — same underlying data, different notation.
- ARPAbet's digit-based stress encoding (`AE1`, `IY0`) is easier to parse programmatically than IPA's `ˈ` prefix marker.
- The stress alternation detection work already in place (see `TestStressVariation` in `tests/prosody/test_pronouncing_stress.py`) works cleanly in ARPAbet.

The notation is not the gap. The gap is **coverage**.

---

## The real problem: out-of-vocabulary words

`phones_for_word("ChatGPT")` → `[]`

An author's OOV vocabulary — neologisms, compound coinages, brand names, domain-specific jargon, proper nouns — is silently excluded from prosodic analysis. For a stylometry library this matters because:

- Author-specific vocabulary is often stylistically significant.
- OOV words appear disproportionately in the kinds of text stylometry is applied to (academic writing, fiction, marketing copy, technical prose).
- Measuring the prosodic texture of an author's invented or domain-specific language is a genuine analytical gain over measuring only their common-word usage.

---

## Scientific approach: G2P cascade in ARPAbet

**ARPAbet itself is already complete** — it covers all ~44 English phonemes. The phoneme inventory does not need extending. What needs extending is the *mapping* from English orthography to ARPAbet symbols.

This is the **Grapheme-to-Phoneme (G2P) problem**. A principled cascade keeps all outputs within the ARPAbet framework, preserving comparability with CMU-sourced entries:

### Tier 1 — CMU lookup (exact, zero error)
No change from current behaviour.

### Tier 2 — Morphological decomposition (separate project — see below)
Split OOV words into known morphemes by inverting a derivation graph:

- `cabbalistical` → `cabbalistic` → `cabbalist` (CMU lookup succeeds)
- `unhappiness` → `happiness` → `happy` (CMU lookup succeeds)
- Handles the allomorphic alternation problem without rules — the inversion encodes surface changes directly.

### Tier 3 — Suffix/pattern rules
English stress is strongly conditioned by morphology:

- `-tion` / `-sion` → stress on preceding syllable (Chomsky & Halle, *The Sound Pattern of English*, 1968)
- `-ic` → stress on self
- Compound nouns → primary stress on first element
- Acronyms → spell out letter by letter (each letter has a CMU entry)

This tier handles jargon, coinages, and technical terms with high accuracy.

### Tier 4 — Neural G2P model (fallback)
For true unknowns. The CMU dict is 134K word→ARPAbet pairs — excellent training data for a seq2seq character→phoneme model.

**Candidate library:** `g2p-en` — trained on CMU, outputs ARPAbet, near-drop-in for the `pronouncing.py` abstraction. Callers would not know which tier handled a given word.

---

## Tier 2: Extracted to a new standalone project

Tier 2 requires a **derivation graph with inverted index** — mapping any derived form back to a CMU-lookable ancestor. This is non-trivial enough to warrant its own project and will be consumed by pystylometry as a dependency once built.

### Why no existing library does this

Research into existing libraries found no suitable off-the-shelf solution for **reversible morpheme boundary detection**.

**The allomorphic alternation problem** makes naive suffix stripping fail:
- `happiness` is not `happy` + `ness` at the surface — the `y` became `i`
- `running` is not `run` + `ing` — the consonant doubled
- `decision` is not `decide` + `sion` — the stem changed shape entirely

| Library | Approach | Why it falls short |
|---|---|---|
| `morfessor` | Unsupervised corpus statistics | Splits statistically, not CMU-lookable. Produces `happi` not `happiness`. |
| `spaCy` / `stanza` | Inflectional morphology only | Handles `walks → walk`, not derivational decomposition. |
| `NLTK` stemmers | Destructive (Porter, Snowball) | Produces non-words by design. |
| `wordsegment` | Frequency compound splitting | No affix awareness. |
| `g2p-en` | CMU + neural | Skips morphological tier entirely (Tier 1 + Tier 4 only). |

### Prior art: lemmastem

An existing repo (`craigtrim/lemmastem`, *"Unigram Lexicon segmented by Lemmas and Stems"*) was evaluated. It contains a derivation graph organised as root→derived with intermediate forms also promoted to top-level keys. The `ca` bigram file alone has ~23K entries; total data is ~56MB of auto-generated Python.

However the repo gives no methodological provenance — two commits ("Project Init", "Initialize"), no README content, no documentation, and only `# AUTO-GENERATED CONTENT` comments in the data files. The source corpus is unknown; the presence of OCR artifacts and unusual character sequences suggests a broad web-crawl unigram list rather than a curated lexical resource.

It is not a suitable foundation. The new project should be built from scratch with documented, academically defensible sources.

### Planned source data for the new project

| Source | Role |
|---|---|
| **WordNet** | Primary derivation structure. Human-curated morphosemantic links (`happy → happiness → unhappiness`). No noise. |
| **BNC** | Frequency-ranked British English. Coverage of academic and formal register — the primary domain of stylometry. |
| **Google Unigrams** | Scale and contemporary coverage. Neologisms, brand names, technical terms that WordNet and BNC miss. |

**Synthesis:** Use WordNet for derivation graph structure; BNC and Google Unigrams to determine which forms are worth indexing (frequency threshold filters OCR noise and hapax legomena). Build inverted index from result.

### What the new project should produce

An inverted derivation index: `derived_form → immediate_parent` for all forms in the graph, plus a `root(word)` function that chains parent pointers until a CMU-lookable ancestor is found. Consumed by pystylometry's `pronouncing.py` as a dependency behind `phones_for_word()`.

---

## Why this is scientifically defensible for stylometry

- All outputs remain in ARPAbet → derived metrics (syllable count, stress position, beat weight) are directly comparable between exact-lookup and G2P-predicted words.
- No mixed phonological assumptions or notation systems.
- Stress prediction accuracy is highest for morphologically regular words, which are the majority of author OOV vocabulary.

### Honest limitation

Proper nouns — place names, surnames — follow no consistent phonological rule. These are the hardest cases. For stylometry this residual gap is acceptable: proper nouns vary by topic, not author style.

---

## Implementation path (this repo)

The `pronouncing.py` module is already the right abstraction layer. Once the Tier 2 project exists, the cascade is added behind `phones_for_word()` — all callers are unaffected. A `phones_source` field on results could optionally expose which tier was used (useful for downstream confidence weighting).

Suggested module: `pystylometry/prosody/g2p_cascade.py`, calling the Tier 2 project as a dependency.

**This issue is blocked on the new Tier 2 project being built.**

---

## Related Issues

- #76 — Beat Detection — Phrase-Level Stress Shape Analysis
- #60 — Replace pronouncing with direct cmudict access

## References

- Chomsky, N. & Halle, M. (1968). *The Sound Pattern of English*. Harper & Row.
- Liberman, M. & Prince, A. (1977). On stress and linguistic rhythm. *Linguistic Inquiry*, 8(2), 249–336.
- Weide, R. L. (1998). The CMU Pronouncing Dictionary, release 0.6d. Carnegie Mellon University.
- `g2p-en`: https://github.com/Kyubyong/g2p
- `morfessor`: https://github.com/aalto-speech/morfessor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Extend ARPAbet coverage to out-of-vocabulary words via G2P cascade #81

Background

Why not just switch to IPA?

The real problem: out-of-vocabulary words

Scientific approach: G2P cascade in ARPAbet

Tier 1 — CMU lookup (exact, zero error)

Tier 2 — Morphological decomposition (separate project — see below)

Tier 3 — Suffix/pattern rules

Tier 4 — Neural G2P model (fallback)

Tier 2: Extracted to a new standalone project

Why no existing library does this

Prior art: lemmastem

Planned source data for the new project

What the new project should produce

Why this is scientifically defensible for stylometry

Honest limitation

Implementation path (this repo)

Related Issues

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Library	Approach	Why it falls short
`morfessor`	Unsupervised corpus statistics	Splits statistically, not CMU-lookable. Produces `happi` not `happiness`.
`spaCy` / `stanza`	Inflectional morphology only	Handles `walks → walk`, not derivational decomposition.
`NLTK` stemmers	Destructive (Porter, Snowball)	Produces non-words by design.
`wordsegment`	Frequency compound splitting	No affix awareness.
`g2p-en`	CMU + neural	Skips morphological tier entirely (Tier 1 + Tier 4 only).

Source	Role
WordNet	Primary derivation structure. Human-curated morphosemantic links (`happy → happiness → unhappiness`). No noise.
BNC	Frequency-ranked British English. Coverage of academic and formal register — the primary domain of stylometry.
Google Unigrams	Scale and contemporary coverage. Neologisms, brand names, technical terms that WordNet and BNC miss.

feat: Extend ARPAbet coverage to out-of-vocabulary words via G2P cascade #81

Description

Background

Why not just switch to IPA?

The real problem: out-of-vocabulary words

Scientific approach: G2P cascade in ARPAbet

Tier 1 — CMU lookup (exact, zero error)

Tier 2 — Morphological decomposition (separate project — see below)

Tier 3 — Suffix/pattern rules

Tier 4 — Neural G2P model (fallback)

Tier 2: Extracted to a new standalone project

Why no existing library does this

Prior art: lemmastem

Planned source data for the new project

What the new project should produce

Why this is scientifically defensible for stylometry

Honest limitation

Implementation path (this repo)

Related Issues

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions