Skip to content

Add Uzbek Stop Words to Stopwords Corpus #253

@comp-linguist

Description

@comp-linguist

Hello NLTK Team,
I am proposing the addition of a comprehensive Uzbek stop words list to the official NLTK stopwords corpus. This resource is foundational for NLP tasks in Uzbek, a language currently underserved in standard NLP toolkits.

1. Source and Rationale
The stop word list was manually curated based on established Uzbek grammatical definitions rather than automated frequency methods. This methodology was chosen because prior work (Madatov et al., 2023) demonstrated that automatic detection approaches (such as TF-IDF, unigram, bigram, and collocation) resulted in an inaccurate list for Uzbek, including too many lexical words and omitting essential inflected forms of functional words.
My list follows the grammatical categorization defined by Madatov et al. (2021), specifically including base and inflected forms of:

  • Postpositions
  • Determiners
  • Auxiliaries
  • Degree Adverbs
  • Pronouns
    This approach ensures the list is highly accurate and robust for preprocessing Uzbek text.

2. Suggested NLTK Details
Suggested NLTK Name: uzbek
Existing Corpus Reader: The list will be provided as a plain text file, accessible via the existing nltk.corpus.stopwords reader, similar to other languages.

3. Licensing and Distribution
I plan to release this list under a Creative Commons ShareAlike license (or another appropriate permissive license) to ensure it is freely and openly redistributable. I will confirm the specific license details in the pull request.
I look forward to your feedback and approval to proceed with the pull request for the nltk_data repository.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions