SansSandhi

This repository provides a character-level model for Sanskrit text classification, predicting whether each character in a word is part of a Sandhi point (SP) or not (NSP). It includes preprocessing, training with an MLP model, and APIs (FastAPI) and a Telegram bot for easy predictions.

Sanskrit Sandhi Point Prediction Bot

This project provides a character-level classification model for Sanskrit text. It predicts whether each character in a word is part of a Sandhi Point (SP) or not (NSP). The system includes an API for model interaction and a Telegram bot for user-friendly predictions. These labels can help in reconstructing the original words before Sandhi formation, improving Sanskrit text segmentation, and aiding NLP tasks. Currently, the model may not be that accurate but we are working to improve and also explore different aspects like using different tokenizers and also exploring RNNs

Features

Character-level classification: Predicts if each character in a Sanskrit word is part of a Sandhi point.
Preprocessing pipeline: Tokenizes and pads Sanskrit words for input to the model.
API integration: Provides a FastAPI server for model interaction.
Telegram bot: Allows users to interact with the model easily by sending Sanskrit words.

Setup Instructions

Prerequisites

Python 3.x
Telegram bot token (Create a bot on Telegram via BotFather)
Install dependencies via pip (detailed below).

Installation

Clone the repository:

git clone https://github.com/your-username/sanskrit-sandhi-prediction.git
cd sanskrit-sandhi-prediction```

Set up a virtual environment (optional but recommended):

python -m venv venv
source venv/bin/activate  # For Linux/Mac
venv\Scripts\activate     # For Windows

Install dependencies:
```
pip install -r requirements.txt
```
Set up the environment:

Create a .env file in the root directory of the project

Add your Telegram bot token in the .env file:

    TELEGRAM_BOT_TOKEN=your_bot_token_here

Running the Model Locally

Train the model:
- Run the following command to train the model:
```
python SansSandhi.py
```
- This will process the dataset, train the Multi-Layer Perceptron (MLP) model, and save it as sanskrit_model.pkl
Start the FastAPI server:
- Start the FastAPI API to serve predictions:
```
Copyuvicorn app:app --reload
```
- The API will be available at http://127.0.0.1:8000

Running the Telegram Bot

Start the bot:
- Run the bot script:
```
python bot.py
```
- The bot will be active and respond to messages sent to it
Interacting with the bot:
- Send a Sanskrit word to the bot
- The bot will split the word by characters and predict whether each character is part of a Sandhi point (SP) or not (NSP)

Example Usage

Telegram Bot:

Send a word like यॊयस्माज्जायते to the bot
The bot will return the predictions for each character: either SP or NSP

How It Works

Data Preprocessing:
- The dataset is preprocessed to tokenize Sanskrit words and split them into sequences of characters
- The tokenizer is trained on a predefined set of characters and used to convert words into sequences of indices
Model Training:
- A Multi-Layer Perceptron (MLP) model is built to classify each character in a word as part of a Sandhi point (SP) or not (NSP)
- The model is trained on tokenized and padded sequences of words
Prediction:
- The trained model is used to predict the label (SP or NSP) for each character in a given Sanskrit word
- The FastAPI server and Telegram bot provide interfaces for users to interact with the model

Folder Structure

sanskrit-sandhi-prediction/
├── app.py             # FastAPI API to serve the model
├── bot.py             # Telegram bot script
├── train.py           # Script to train the model
├── sanskrit_model.pkl # Trained model file
├── requirements.txt   # Python dependencies
├── .env              # Store your Telegram bot token here
└── data/
    └── dataset.txt   # Training dataset

Contributions

You can help refine the dataset creation, code for that is available at https://github.com/pradyumna-7/Sandhi-Splits
Look forward to contributions of refinement of the model

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
Dockerfile		Dockerfile
README.md		README.md
SansSandhi-BiLSTM.py		SansSandhi-BiLSTM.py
SansSandhi-CNN+BiLSTM.py		SansSandhi-CNN+BiLSTM.py
SansSandhi-GRU.py		SansSandhi-GRU.py
SansSandhi.py		SansSandhi.py
app.py		app.py
bot.py		bot.py
requirements.txt		requirements.txt
sanskrit_model.pkl		sanskrit_model.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SansSandhi

Sanskrit Sandhi Point Prediction Bot

Features

Setup Instructions

Prerequisites

Installation

Running the Model Locally

Running the Telegram Bot

Example Usage

How It Works

Folder Structure

Contributions

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SansSandhi

Sanskrit Sandhi Point Prediction Bot

Features

Setup Instructions

Prerequisites

Installation

Running the Model Locally

Running the Telegram Bot

Example Usage

How It Works

Folder Structure

Contributions

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages