Add Uzbek Stop Words to Stopwords Corpus

Hello NLTK Team,
I am proposing the addition of a comprehensive Uzbek stop words list to the official NLTK stopwords corpus. This resource is foundational for NLP tasks in Uzbek, a language currently underserved in standard NLP toolkits.

**1. Source and Rationale**
The stop word list was manually curated based on established Uzbek grammatical definitions rather than automated frequency methods. This methodology was chosen because prior work ([Madatov et al., 2023](https://www.informatica.si/index.php/informatica/article/view/3788)) demonstrated that automatic detection approaches (such as TF-IDF, unigram, bigram, and collocation) resulted in an inaccurate list for Uzbek, including too many lexical words and omitting essential inflected forms of functional words.
My list follows the grammatical categorization defined by [Madatov et al. (2021)](https://www.researchgate.net/publication/351109229_STOP_WORDS_IN_UZBEK_LANGUAGE_TEXTS_O'ZBEK_TILI_MATNLARIDAGI_NOMUHIM_SO'ZLAR), specifically including base and inflected forms of:
- Postpositions
- Determiners
- Auxiliaries
- Degree Adverbs
- Pronouns
This approach ensures the list is highly accurate and robust for preprocessing Uzbek text.

**2. Suggested NLTK Details**
Suggested NLTK Name: uzbek
Existing Corpus Reader: The list will be provided as a plain text file, accessible via the existing nltk.corpus.stopwords reader, similar to other languages.

**3. Licensing and Distribution**
I plan to release this list under a Creative Commons ShareAlike license (or another appropriate permissive license) to ensure it is freely and openly redistributable. I will confirm the specific license details in the pull request.
I look forward to your feedback and approval to proceed with the pull request for the nltk_data repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Uzbek Stop Words to Stopwords Corpus #253

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add Uzbek Stop Words to Stopwords Corpus #253

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions