A lightweight language model implementation with ChatML template support and GPU memory monitoring.
pip install --upgrade "transformers>=4.52.0" "accelerate" torchfrom transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model and tokenizer
model_name = "sartifyllc/pawa-min-alpha"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
model_name,
trust_remote_code=True
)- ChatML Template Support: Properly formatted conversation handling
- GPU Memory Monitoring: Track memory usage during inference
- Multi-language Support: Demonstrated with English-Swahili translation
- Efficient Memory Usage: Uses bfloat16 precision for optimal performance
- Automatic Device Placement: Handles GPU/CPU allocation automatically
- Model:
sartifyllc/pawa-min-alpha - Max Sequence Length: 2048 tokens
- Precision: bfloat16
- Temperature: 0.2 (adjustable)
- Top-p: 0.9 (nucleus sampling)
generation_config = {
"max_new_tokens": 128,
"do_sample": True,
"temperature": 0.2,
"top_p": 0.9,
"pad_token_id": tokenizer.pad_token_id,
"eos_token_id": tokenizer.eos_token_id,
"use_cache": True
}The model uses ChatML format for conversations:
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing well, thank you!"}
]Supported roles:
system- System instructionsuser- User messagesassistant- Model responses
The implementation includes comprehensive GPU memory tracking:
# Before inference
gpu_stats = torch.cuda.get_device_properties(0)
start_reserved = torch.cuda.max_memory_reserved() / 1024**3
# After inference
end_reserved = torch.cuda.max_memory_reserved() / 1024**3
delta_reserved = end_reserved - start_reserveddef translate_swahili_to_english(swahili_text):
messages = [
{
"role": "user",
"content": f"Translate to English from swahili: '{swahili_text}'"
}
]
response = generate_response(messages, max_new_tokens=128)
return response
# Example usage
swahili_text = "Baba yangu ni shujaa na anaishi Marekani"
translation = translate_swahili_to_english(swahili_text)
print(f"Original: {swahili_text}")
print(f"Translation: {translation}")Converts conversation messages into ChatML format.
Parameters:
messages(list): List of message dictionaries withroleandcontent
Returns:
str: Formatted ChatML template string
Generates model response for given conversation.
Parameters:
messages(list|str): Conversation messages or single stringmax_new_tokens(int): Maximum tokens to generate
Returns:
list: Generated response(s)
The script provides detailed memory usage statistics:
🖥️ GPU: NVIDIA GeForce RTX 4090
📊 Max memory: 24.0 GB
🔹 Reserved before Inference: 2.1 GB
📈 Peak reserved memory after Inference: 3.8 GB
📉 Additional memory used for Inference: 1.7 GB
💯 Total memory used (%): 15.8 %
🧠 Inference memory usage (%): 7.1 %
- Python 3.8+
- PyTorch with CUDA support
- Transformers >= 4.52.0
- Accelerate library
- CUDA-compatible GPU (recommended)
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Model Hub: sartifyllc/pawa-min-alpha
- Architecture: Causal Language Model
- Use Case: Multi-language text generation and translation
- CUDA Out of Memory: Reduce
max_seq_lengthor use CPU inference - Model Loading Errors: Ensure
trust_remote_code=Trueis set - Tokenizer Issues: Verify pad_token is properly configured
- Use
torch.bfloat16ortorch.float16for reduced memory usage - Enable
use_cache=Truefor faster inference - Consider gradient checkpointing for training scenarios
Note: This model is designed for research and educational purposes. Please review the model's capabilities and limitations before production use.