Skip to content

Sentence splitting errors and different output compared to NLTK #16

@Scarfmonster

Description

@Scarfmonster

Rust-punkt and NLTK Punkt (with aligning off) produce different results when using exactly the same model. NLTK Punkt correctly identifies abbreviations and doesn't split on them, while rust-punkt, with the same model, splits sentences on almost every period.

To test things, I loaded the JSON model from rust-punkt:

from collections import defaultdict
from nltk.tokenize.punkt import PunktParameters, PunktSentenceTokenizer

with open(model_path, mode='r', encoding='UTF8') as model_file:
    model = json.load(model_file)

params = PunktParameters()
params.sent_starters = set(model['sentence_starters'])
params.abbrev_types = set(model['abbrev_types'])
params.collocations = set([tuple(t) for t in model['collocations']])
params.ortho_context = defaultdict(int, model['ortho_context'])

punkt = PunktSentenceTokenizer(params)
punkt.tokenize(text, realign_boundaries=False)

The output from NLTK Punkt:

Choć zapis pól ($X) może kojarzyć się z zapisem określającym zmienne (jak np. w perlu), to jednak określa pola bieżącego rekordu.

While rust-punkt produced:

Choć zapis pól ($X) może kojarzyć się z zapisem określającym zmienne (jak np.
w perlu), to jednak określa pola bieżącego rekordu.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions