A significantly improved conversational AI model designed to reduce hallucination and provide more coherent responses.
- Before: Character-level tokenization (every letter = 1 token)
- After: Byte-Pair Encoding (BPE) with 8000 vocabulary size
- Impact: Eliminates broken words, better language understanding
- Before: Simple GRU with 128/256 hidden size
- After: Multi-layer LSTM (3 layers) with 256/512 hidden size + Attention
- Features:
- Multi-head attention mechanism
- Layer normalization
- Dropout for regularization
- Proper weight initialization
- Before: Basic training with fixed learning rate
- After:
- AdamW optimizer with weight decay
- Cosine annealing learning rate scheduler
- Gradient clipping
- Early stopping with validation
- Train/validation split (90/10)
- Before: Basic greedy sampling
- After:
- Top-k filtering (k=50)
- Top-p (nucleus) sampling (p=0.9)
- Temperature control
- Conversation memory (last 3 exchanges)
- Before: DailyDialog only (~13k conversations)
- After: DailyDialog + OpenAssistant (much larger, more diverse)
- Features: Better filtering, more varied conversation styles
SLM/
├── data/
│ ├── brain.txt # Training dataset
│ ├── vocab.json # Vocabulary mapping
│ ├── encoded.txt # Tokenized data
│ └── tokenizer.json # Subword tokenizer
├── model/
│ ├── slm.py # Enhanced model architecture
│ └── slm_weight.pt # Trained weights
├── GET_Data.py # Dataset collection
├── tokenizer.py # Subword tokenizer creation
├── train.py # Enhanced training script
├── generate.py # Improved generation script
├── test_setup.py # Setup verification
└── requirements.txt # Dependencies
-
Install dependencies:
pip install -r requirements.txt
-
Get enhanced dataset:
python GET_Data.py
-
Create subword tokenizer:
python tokenizer.py
-
Train the model:
python train.py
-
Chat with your bot:
python generate.py
Run the test suite to verify everything is working:
python test_setup.py- Subword tokenization prevents broken word generation
- Larger context window (128 tokens vs 64) for better memory
- Attention mechanism helps focus on relevant context
- Better sampling strategies reduce repetitive/nonsensical outputs
- Conversation memory maintains context across exchanges
- Multi-layer architecture captures more complex patterns
- Validation-based training prevents overfitting
- Enhanced dataset provides better training examples
- Interactive chat loop with conversation history
- Error handling for graceful failures
- Clear commands (quit, clear, etc.)
- Progress indicators during training
# Enhanced model configuration
embed_size = 256 # Increased from 128
hidden_size = 512 # Increased from 256
num_layers = 3 # Multi-layer architecture
dropout = 0.1 # Regularizationseq_length = 128 # Increased context window
batch_size = 16 # Optimized for larger model
learning_rate = 0.0001
weight_decay = 0.01 # L2 regularizationtemperature = 0.7 # Controls randomness
top_k = 40 # Top-k filtering
top_p = 0.85 # Nucleus sampling
max_new_tokens = 80 # Response length| Metric | Before | After | Improvement |
|---|---|---|---|
| Vocab Size | ~75 chars | 8000 subwords | 106x larger |
| Context Window | 64 tokens | 128 tokens | 2x larger |
| Model Parameters | ~500K | ~2.5M | 5x larger |
| Architecture | GRU | LSTM + Attention | More sophisticated |
| Sampling | Greedy | Top-k + Top-p | Better diversity |
-
"No module named 'torch'"
- Run:
pip install torch numpy tqdm datasets transformers
- Run:
-
Tokenizer not found
- Run:
python tokenizer.pyfirst
- Run:
-
Out of memory during training
- Reduce
batch_sizeintrain.py - Use CPU instead of GPU
- Reduce
-
Poor generation quality
- Train for more epochs
- Adjust temperature/top-k/top-p parameters
- Check dataset quality
Feel free to submit issues and enhancement requests!
This project is open source and available under the MIT License.