Sentence splitting errors and different output compared to NLTK

Rust-punkt and NLTK Punkt (with aligning off) produce different results when using exactly the same model. NLTK Punkt correctly identifies abbreviations and doesn't split on them, while rust-punkt, with the same model, splits sentences on almost every period.

To test things, I loaded the JSON model from rust-punkt:

```python
from collections import defaultdict
from nltk.tokenize.punkt import PunktParameters, PunktSentenceTokenizer

with open(model_path, mode='r', encoding='UTF8') as model_file:
    model = json.load(model_file)

params = PunktParameters()
params.sent_starters = set(model['sentence_starters'])
params.abbrev_types = set(model['abbrev_types'])
params.collocations = set([tuple(t) for t in model['collocations']])
params.ortho_context = defaultdict(int, model['ortho_context'])

punkt = PunktSentenceTokenizer(params)
punkt.tokenize(text, realign_boundaries=False)
```

The output from NLTK Punkt:
> Choć zapis pól ($X) może kojarzyć się z zapisem określającym zmienne (jak np. w perlu), to jednak określa pola bieżącego rekordu.

While rust-punkt produced:
> Choć zapis pól ($X) może kojarzyć się z zapisem określającym zmienne (jak np.
w perlu), to jednak określa pola bieżącego rekordu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentence splitting errors and different output compared to NLTK #16

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Sentence splitting errors and different output compared to NLTK #16

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions