git lfs install
git clone https://github.com/eleldar/Punctuation.git
cd Punctuation
python3 -m venv venv
. venv/bin/activate
pip install -r requirements.txt
cd models
git clone https://huggingface.co/eleldar/rubert-base-cased-sentence
git clone https://huggingface.co/eleldar/repunct-model_ft repunct-model_ft/weights/ (venv)$ python main.pyopen http://127.0.0.1:8000/docs in browser!
Before inserting raw text into model it should be tokenized. Library handle it with BaseDataset.parse_tokens
Model architecture is pretty easy and straight forward:
- BERT layer - DeepPavlov/rubert-base-cased-sentence language model
- Bi-LSTM layer - to reduce demsions
- Linear layer - final layer to predict what symbol should go after token
This repository contains code (which was edited for production purposes) from xashru/punctuation-restoration.