Skip to content

Latest commit

 

History

History
69 lines (50 loc) · 2.11 KB

File metadata and controls

69 lines (50 loc) · 2.11 KB

📝 Persian Text Processing with Parsivar

This project demonstrates how to process Persian (Farsi) text using the Parsivar NLP library. It includes text normalization, tokenization, stemming, and spell checking, with additional tools to handle proper display of Persian characters.


🔍 Features

  • Normalization – Cleans and standardizes Persian text.
  • Tokenization – Splits text into sentences and words.
  • Stemming – Converts words to their root forms.
  • Spell Checking – Detects and corrects misspellings in Persian.
  • Display Support – Uses arabic_reshaper and python-bidi to fix RTL display issues.

🧰 Libraries Used

  • parsivar – NLP tools for Persian.
  • arabic_reshaper – For reshaping characters to correct forms.
  • python-bidi – Ensures proper display of RTL scripts like Persian.

📌 How It Works

  1. Read Persian text from a .txt file.
  2. Normalize the text using Parsivar.
  3. Tokenize the normalized text into words and sentences.
  4. Apply stemming to get root forms of words.
  5. Use spell correction on custom input.
  6. Display reshaped output for better readability in terminals.

🚀 Usage

Install dependencies:

pip install parsivar arabic_reshaper python-bidi
pip install -r requirements.txt

then in code first we normalize then we tokenize, and after tokenize we stammer and in the end for spell detection you need to download these two files and put it in the this below path:

first create a spell folder in this path:
venv\Lib\site-packages\parsivar\resource
then replace these two file in the spell folder:
- onegram.pckl
- mybigram_lm.pckl

🎥preview

3Capture

📳technology

  • python
  • nltk
  • parsivar
  • bidi
  • arabic_reshaper