PAWA-MIN-ALPHA Language Model

A lightweight language model implementation with ChatML template support and GPU memory monitoring.

🚀 Quick Start

Installation

pip install --upgrade "transformers>=4.52.0" "accelerate" torch

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "sartifyllc/pawa-min-alpha"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True
)

📋 Features

ChatML Template Support: Properly formatted conversation handling
GPU Memory Monitoring: Track memory usage during inference
Multi-language Support: Demonstrated with English-Swahili translation
Efficient Memory Usage: Uses bfloat16 precision for optimal performance
Automatic Device Placement: Handles GPU/CPU allocation automatically

🔧 Configuration

Model Parameters

Model: sartifyllc/pawa-min-alpha
Max Sequence Length: 2048 tokens
Precision: bfloat16
Temperature: 0.2 (adjustable)
Top-p: 0.9 (nucleus sampling)

Generation Parameters

generation_config = {
    "max_new_tokens": 128,
    "do_sample": True,
    "temperature": 0.2,
    "top_p": 0.9,
    "pad_token_id": tokenizer.pad_token_id,
    "eos_token_id": tokenizer.eos_token_id,
    "use_cache": True
}

💬 Chat Template Format

The model uses ChatML format for conversations:

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello, how are you?"},
    {"role": "assistant", "content": "I'm doing well, thank you!"}
]

Supported roles:

system - System instructions
user - User messages
assistant - Model responses

📊 Memory Monitoring

The implementation includes comprehensive GPU memory tracking:

# Before inference
gpu_stats = torch.cuda.get_device_properties(0)
start_reserved = torch.cuda.max_memory_reserved() / 1024**3

# After inference  
end_reserved = torch.cuda.max_memory_reserved() / 1024**3
delta_reserved = end_reserved - start_reserved

🌍 Example: Translation Task

def translate_swahili_to_english(swahili_text):
    messages = [
        {
            "role": "user", 
            "content": f"Translate to English from swahili: '{swahili_text}'"
        }
    ]
    
    response = generate_response(messages, max_new_tokens=128)
    return response

# Example usage
swahili_text = "Baba yangu ni shujaa na anaishi Marekani"
translation = translate_swahili_to_english(swahili_text)
print(f"Original: {swahili_text}")
print(f"Translation: {translation}")

🛠️ Functions

`apply_chatml_template(messages)`

Converts conversation messages into ChatML format.

Parameters:

messages (list): List of message dictionaries with role and content

Returns:

str: Formatted ChatML template string

`generate_response(messages, max_new_tokens=64)`

Generates model response for given conversation.

Parameters:

messages (list|str): Conversation messages or single string
max_new_tokens (int): Maximum tokens to generate

Returns:

list: Generated response(s)

📈 Performance Monitoring

The script provides detailed memory usage statistics:

🖥️ GPU: NVIDIA GeForce RTX 4090
📊 Max memory: 24.0 GB
🔹 Reserved before Inference: 2.1 GB
📈 Peak reserved memory after Inference: 3.8 GB
📉 Additional memory used for Inference: 1.7 GB
💯 Total memory used (%): 15.8 %
🧠 Inference memory usage (%): 7.1 %

⚠️ Requirements

Python 3.8+
PyTorch with CUDA support
Transformers >= 4.52.0
Accelerate library
CUDA-compatible GPU (recommended)

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🔗 Model Information

Model Hub: sartifyllc/pawa-min-alpha
Architecture: Causal Language Model
Use Case: Multi-language text generation and translation

🐛 Troubleshooting

Common Issues

CUDA Out of Memory: Reduce max_seq_length or use CPU inference
Model Loading Errors: Ensure trust_remote_code=True is set
Tokenizer Issues: Verify pad_token is properly configured

Memory Optimization Tips

Use torch.bfloat16 or torch.float16 for reduced memory usage
Enable use_cache=True for faster inference
Consider gradient checkpointing for training scenarios

Note: This model is designed for research and educational purposes. Please review the model's capabilities and limitations before production use.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.DS_Store		.DS_Store
Pawa Pitch -2 .pdf		Pawa Pitch -2 .pdf
README.md		README.md
SLM_Pawa.ipynb		SLM_Pawa.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PAWA-MIN-ALPHA Language Model

🚀 Quick Start

Installation

Basic Usage

📋 Features

🔧 Configuration

Model Parameters

Generation Parameters

💬 Chat Template Format

📊 Memory Monitoring

🌍 Example: Translation Task

🛠️ Functions

`apply_chatml_template(messages)`

`generate_response(messages, max_new_tokens=64)`

📈 Performance Monitoring

⚠️ Requirements

🤝 Contributing

📝 License

🔗 Model Information

🐛 Troubleshooting

Common Issues

Memory Optimization Tips

About

Uh oh!

Releases

Packages

Languages

Sartify/IndabaX_SLM

Folders and files

Latest commit

History

Repository files navigation

PAWA-MIN-ALPHA Language Model

🚀 Quick Start

Installation

Basic Usage

📋 Features

🔧 Configuration

Model Parameters

Generation Parameters

💬 Chat Template Format

📊 Memory Monitoring

🌍 Example: Translation Task

🛠️ Functions

apply_chatml_template(messages)

generate_response(messages, max_new_tokens=64)

📈 Performance Monitoring

⚠️ Requirements

🤝 Contributing

📝 License

🔗 Model Information

🐛 Troubleshooting

Common Issues

Memory Optimization Tips

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`apply_chatml_template(messages)`

`generate_response(messages, max_new_tokens=64)`

Packages