Skip to content

[Provider] SuttaCentralDictionaryProvider — extract Sense-level glosses, not just rawExcerpt #41

@anantham

Description

@anantham

Context

Per phase-a curation log §10.5; recurs in phases b/c/d. Surfaced 2026-05-11.

`services/providers/suttaCentralDictionary.ts` returns opaque payloads for many lemmas, leaving `Sense.english` as `(no sense)` while `rawExcerpt` is populated. The raw form is fine for the LLM compiler (it can parse the excerpt) and for future audit UI, but our structured `Sense` glosses can't draw from SC currently.

Particularly painful for words like `evaṁ`, `Kammāsadhammaṁ`, `kurūnaṁ` where DPD lacks a direct entry or the entry is sparse, but PED (Pali Text Society Dictionary, embedded in SC's payload) almost certainly has rich content.

What to investigate

SC's `/api/dictionary_full/{lemma}` returns a structured response. The current parser surfaces `rawExcerpt` but doesn't extract per-sense glosses. Possible reasons:

  1. The structure varies by entry (PED vs DPD vs Concise PED vs Buddhist Dictionary all merged); parser was conservative
  2. Some entries are HTML-formatted, others are JSON arrays
  3. The lemma normalization (niggahīta) might not be applied to the SC query

Acceptance

  • For at least 3 of the 4 MN10 phase-a/b/c/d senses where SC returned `(no sense)`, the provider now returns structured English glosses
  • Provider tests added for the parser improvements

Hit count

4/4 phases (universal pattern — every phase has at least one SC `(no sense)` result)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions