This project trains a transformer-based English ↔ Gujarati translation model from scratch using a custom-trained Byte-Level BPE tokenizer and Hugging Face's EncoderDecoderModel.
- Python 🐍
- HuggingFace Transformers 🤗
- Tokenizers (Byte-Level BPE)
- PyTorch (GPU with Mixed Precision)
uvfor clean dependency management- Logging + TQDM for progress tracking
uv venv --python python3.11
uv pip install -e .
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121eng-guj-translator/
├── data/
│ └── opus.gu-en.tsv
├── models/
│ └── v1/
├── logs/
│ └── train.log
├── src/
│ └── translator/
│ ├── train_model.py
│ ├── translate.py
│ └── version.py
├── pyproject.toml
└── README.mdflowchart TD
A[TSV Dataset<br>English-Gujarati] --> B[Custom BPE Tokenizer]
B --> C[PreTrainedTokenizerFast]
C --> D[EncoderDecoderModel<br> BERT2BERT]
D --> E[Trainer<br>(HuggingFace)]
E --> F[Trained Model + Tokenizer Saved]
flowchart TD
RawText -->|split into| BPE[Byte-Level BPE Tokens]
BPE --> TokenIDs[Assigned Token IDs]
TokenIDs --> TokenizerJSON[tokenizer.json saved]
TokenizerJSON --> HFTokenizer[PreTrainedTokenizerFast]
python -m translator.train_modelTraining runs on GPU with:
- 🧠 Mixed Precision (fp16)
- 📈 TQDM + custom ETA logging
- 🔁 3 Epochs
- 🧱 BERT as encoder + decoder
- 🌟 Vocabulary size: 32,000
python -m translator.translateInput: "How are you?"
Output: "તમે કેમ છો?"
pad_token_id,bos_token_id,vocab_sizeset explicitlyfp16=Trueenables mixed-precisionTrainerhandles automatic GPU/CPU usagetokenizer.jsonis reusable across models
- Logs go to:
logs/train.log - ETA + memory usage logged
- Supports tqdm progress bars during tokenization
- Add BLEU score evaluation
- Build a Gradio interface for GUI-based translation
- Automate model versioning (
v1,v2, etc.)
Divyang — Solution Architect working in Cloud, AI/ML & Semiconductors
MIT