Enhanced SLM (Simple Language Model) 🤖

A significantly improved conversational AI model designed to reduce hallucination and provide more coherent responses.

🚀 Key Improvements Made

1. Subword Tokenization

Before: Character-level tokenization (every letter = 1 token)
After: Byte-Pair Encoding (BPE) with 8000 vocabulary size
Impact: Eliminates broken words, better language understanding

2. Enhanced Model Architecture

Before: Simple GRU with 128/256 hidden size
After: Multi-layer LSTM (3 layers) with 256/512 hidden size + Attention
Features:
- Multi-head attention mechanism
- Layer normalization
- Dropout for regularization
- Proper weight initialization

3. Better Training Strategy

Before: Basic training with fixed learning rate
After:
- AdamW optimizer with weight decay
- Cosine annealing learning rate scheduler
- Gradient clipping
- Early stopping with validation
- Train/validation split (90/10)

4. Improved Generation

Before: Basic greedy sampling
After:
- Top-k filtering (k=50)
- Top-p (nucleus) sampling (p=0.9)
- Temperature control
- Conversation memory (last 3 exchanges)

5. Enhanced Dataset

Before: DailyDialog only (~13k conversations)
After: DailyDialog + OpenAssistant (much larger, more diverse)
Features: Better filtering, more varied conversation styles

📁 Project Structure

SLM/
├── data/
│   ├── brain.txt          # Training dataset
│   ├── vocab.json         # Vocabulary mapping
│   ├── encoded.txt        # Tokenized data
│   └── tokenizer.json     # Subword tokenizer
├── model/
│   ├── slm.py            # Enhanced model architecture
│   └── slm_weight.pt     # Trained weights
├── GET_Data.py           # Dataset collection
├── tokenizer.py          # Subword tokenizer creation
├── train.py              # Enhanced training script
├── generate.py           # Improved generation script
├── test_setup.py         # Setup verification
└── requirements.txt      # Dependencies

🛠️ Installation & Setup

Install dependencies:
```
pip install -r requirements.txt
```
Get enhanced dataset:
```
python GET_Data.py
```
Create subword tokenizer:
```
python tokenizer.py
```
Train the model:
```
python train.py
```
Chat with your bot:
```
python generate.py
```

🧪 Testing

Run the test suite to verify everything is working:

python test_setup.py

🎯 Key Features

Reduced Hallucination

Subword tokenization prevents broken word generation
Larger context window (128 tokens vs 64) for better memory
Attention mechanism helps focus on relevant context
Better sampling strategies reduce repetitive/nonsensical outputs

Improved Coherence

Conversation memory maintains context across exchanges
Multi-layer architecture captures more complex patterns
Validation-based training prevents overfitting
Enhanced dataset provides better training examples

Better User Experience

Interactive chat loop with conversation history
Error handling for graceful failures
Clear commands (quit, clear, etc.)
Progress indicators during training

🔧 Configuration

Model Parameters

# Enhanced model configuration
embed_size = 256      # Increased from 128
hidden_size = 512     # Increased from 256
num_layers = 3        # Multi-layer architecture
dropout = 0.1         # Regularization

Training Parameters

seq_length = 128      # Increased context window
batch_size = 16       # Optimized for larger model
learning_rate = 0.0001
weight_decay = 0.01   # L2 regularization

Generation Parameters

temperature = 0.7     # Controls randomness
top_k = 40           # Top-k filtering
top_p = 0.85         # Nucleus sampling
max_new_tokens = 80  # Response length

📊 Performance Improvements

Metric	Before	After	Improvement
Vocab Size	~75 chars	8000 subwords	106x larger
Context Window	64 tokens	128 tokens	2x larger
Model Parameters	~500K	~2.5M	5x larger
Architecture	GRU	LSTM + Attention	More sophisticated
Sampling	Greedy	Top-k + Top-p	Better diversity

🚨 Troubleshooting

Common Issues

"No module named 'torch'"
- Run: pip install torch numpy tqdm datasets transformers
Tokenizer not found
- Run: python tokenizer.py first
Out of memory during training
- Reduce batch_size in train.py
- Use CPU instead of GPU
Poor generation quality
- Train for more epochs
- Adjust temperature/top-k/top-p parameters
- Check dataset quality

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enhanced SLM (Simple Language Model) 🤖

🚀 Key Improvements Made

1. Subword Tokenization

2. Enhanced Model Architecture

3. Better Training Strategy

4. Improved Generation

5. Enhanced Dataset

📁 Project Structure

🛠️ Installation & Setup

🧪 Testing

🎯 Key Features

Reduced Hallucination

Improved Coherence

Better User Experience

🔧 Configuration

Model Parameters

Training Parameters

Generation Parameters

📊 Performance Improvements

🚨 Troubleshooting

Common Issues

🎓 Learning Resources

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Data_debug		Data_debug
data		data
model		model
Format_Dataset.py		Format_Dataset.py
GET_Data.py		GET_Data.py
README.md		README.md
generate.py		generate.py
requirements.txt		requirements.txt
test_setup.py		test_setup.py
tokenizer.py		tokenizer.py
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

Enhanced SLM (Simple Language Model) 🤖

🚀 Key Improvements Made

1. Subword Tokenization

2. Enhanced Model Architecture

3. Better Training Strategy

4. Improved Generation

5. Enhanced Dataset

📁 Project Structure

🛠️ Installation & Setup

🧪 Testing

🎯 Key Features

Reduced Hallucination

Improved Coherence

Better User Experience

🔧 Configuration

Model Parameters

Training Parameters

Generation Parameters

📊 Performance Improvements

🚨 Troubleshooting

Common Issues

🎓 Learning Resources

🤝 Contributing

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages