Rust-punkt and NLTK Punkt (with aligning off) produce different results when using exactly the same model. NLTK Punkt correctly identifies abbreviations and doesn't split on them, while rust-punkt, with the same model, splits sentences on almost every period.
To test things, I loaded the JSON model from rust-punkt:
from collections import defaultdict
from nltk.tokenize.punkt import PunktParameters, PunktSentenceTokenizer
with open(model_path, mode='r', encoding='UTF8') as model_file:
model = json.load(model_file)
params = PunktParameters()
params.sent_starters = set(model['sentence_starters'])
params.abbrev_types = set(model['abbrev_types'])
params.collocations = set([tuple(t) for t in model['collocations']])
params.ortho_context = defaultdict(int, model['ortho_context'])
punkt = PunktSentenceTokenizer(params)
punkt.tokenize(text, realign_boundaries=False)
The output from NLTK Punkt:
Choć zapis pól ($X) może kojarzyć się z zapisem określającym zmienne (jak np. w perlu), to jednak określa pola bieżącego rekordu.
While rust-punkt produced:
Choć zapis pól ($X) może kojarzyć się z zapisem określającym zmienne (jak np.
w perlu), to jednak określa pola bieżącego rekordu.
Rust-punkt and NLTK Punkt (with aligning off) produce different results when using exactly the same model. NLTK Punkt correctly identifies abbreviations and doesn't split on them, while rust-punkt, with the same model, splits sentences on almost every period.
To test things, I loaded the JSON model from rust-punkt:
The output from NLTK Punkt:
While rust-punkt produced: