Hello NLTK Team,
I am proposing the addition of a comprehensive Uzbek stop words list to the official NLTK stopwords corpus. This resource is foundational for NLP tasks in Uzbek, a language currently underserved in standard NLP toolkits.
1. Source and Rationale
The stop word list was manually curated based on established Uzbek grammatical definitions rather than automated frequency methods. This methodology was chosen because prior work (Madatov et al., 2023) demonstrated that automatic detection approaches (such as TF-IDF, unigram, bigram, and collocation) resulted in an inaccurate list for Uzbek, including too many lexical words and omitting essential inflected forms of functional words.
My list follows the grammatical categorization defined by Madatov et al. (2021), specifically including base and inflected forms of:
- Postpositions
- Determiners
- Auxiliaries
- Degree Adverbs
- Pronouns
This approach ensures the list is highly accurate and robust for preprocessing Uzbek text.
2. Suggested NLTK Details
Suggested NLTK Name: uzbek
Existing Corpus Reader: The list will be provided as a plain text file, accessible via the existing nltk.corpus.stopwords reader, similar to other languages.
3. Licensing and Distribution
I plan to release this list under a Creative Commons ShareAlike license (or another appropriate permissive license) to ensure it is freely and openly redistributable. I will confirm the specific license details in the pull request.
I look forward to your feedback and approval to proceed with the pull request for the nltk_data repository.
Hello NLTK Team,
I am proposing the addition of a comprehensive Uzbek stop words list to the official NLTK stopwords corpus. This resource is foundational for NLP tasks in Uzbek, a language currently underserved in standard NLP toolkits.
1. Source and Rationale
The stop word list was manually curated based on established Uzbek grammatical definitions rather than automated frequency methods. This methodology was chosen because prior work (Madatov et al., 2023) demonstrated that automatic detection approaches (such as TF-IDF, unigram, bigram, and collocation) resulted in an inaccurate list for Uzbek, including too many lexical words and omitting essential inflected forms of functional words.
My list follows the grammatical categorization defined by Madatov et al. (2021), specifically including base and inflected forms of:
This approach ensures the list is highly accurate and robust for preprocessing Uzbek text.
2. Suggested NLTK Details
Suggested NLTK Name: uzbek
Existing Corpus Reader: The list will be provided as a plain text file, accessible via the existing nltk.corpus.stopwords reader, similar to other languages.
3. Licensing and Distribution
I plan to release this list under a Creative Commons ShareAlike license (or another appropriate permissive license) to ensure it is freely and openly redistributable. I will confirm the specific license details in the pull request.
I look forward to your feedback and approval to proceed with the pull request for the nltk_data repository.