This repository provides a character-level model for Sanskrit text classification, predicting whether each character in a word is part of a Sandhi point (SP) or not (NSP). It includes preprocessing, training with an MLP model, and APIs (FastAPI) and a Telegram bot for easy predictions.
This project provides a character-level classification model for Sanskrit text. It predicts whether each character in a word is part of a Sandhi Point (SP) or not (NSP). The system includes an API for model interaction and a Telegram bot for user-friendly predictions. These labels can help in reconstructing the original words before Sandhi formation, improving Sanskrit text segmentation, and aiding NLP tasks. Currently, the model may not be that accurate but we are working to improve and also explore different aspects like using different tokenizers and also exploring RNNs
- Character-level classification: Predicts if each character in a Sanskrit word is part of a Sandhi point.
- Preprocessing pipeline: Tokenizes and pads Sanskrit words for input to the model.
- API integration: Provides a FastAPI server for model interaction.
- Telegram bot: Allows users to interact with the model easily by sending Sanskrit words.
- Python 3.x
- Telegram bot token (Create a bot on Telegram via BotFather)
- Install dependencies via
pip(detailed below).
- Clone the repository:
git clone https://github.com/your-username/sanskrit-sandhi-prediction.git cd sanskrit-sandhi-prediction```
- Set up a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # For Linux/Mac venv\Scripts\activate # For Windows
- Install dependencies:
pip install -r requirements.txt
- Set up the environment:
- Create a .env file in the root directory of the project
- Add your Telegram bot token in the .env file:
TELEGRAM_BOT_TOKEN=your_bot_token_here
-
Train the model:
- Run the following command to train the model:
python SansSandhi.py
- This will process the dataset, train the Multi-Layer Perceptron (MLP) model, and save it as sanskrit_model.pkl
- Run the following command to train the model:
-
Start the FastAPI server:
- Start the FastAPI API to serve predictions:
Copyuvicorn app:app --reload
- The API will be available at http://127.0.0.1:8000
- Start the FastAPI API to serve predictions:
-
Start the bot:
- Run the bot script:
python bot.py
- The bot will be active and respond to messages sent to it
- Run the bot script:
-
Interacting with the bot:
- Send a Sanskrit word to the bot
- The bot will split the word by characters and predict whether each character is part of a Sandhi point (SP) or not (NSP)
Telegram Bot:
- Send a word like यॊयस्माज्जायते to the bot
- The bot will return the predictions for each character: either SP or NSP
-
Data Preprocessing:
- The dataset is preprocessed to tokenize Sanskrit words and split them into sequences of characters
- The tokenizer is trained on a predefined set of characters and used to convert words into sequences of indices
-
Model Training:
- A Multi-Layer Perceptron (MLP) model is built to classify each character in a word as part of a Sandhi point (SP) or not (NSP)
- The model is trained on tokenized and padded sequences of words
-
Prediction:
- The trained model is used to predict the label (SP or NSP) for each character in a given Sanskrit word
- The FastAPI server and Telegram bot provide interfaces for users to interact with the model
sanskrit-sandhi-prediction/
├── app.py # FastAPI API to serve the model
├── bot.py # Telegram bot script
├── train.py # Script to train the model
├── sanskrit_model.pkl # Trained model file
├── requirements.txt # Python dependencies
├── .env # Store your Telegram bot token here
└── data/
└── dataset.txt # Training dataset
- You can help refine the dataset creation, code for that is available at https://github.com/pradyumna-7/Sandhi-Splits
- Look forward to contributions of refinement of the model