Fine-tune Mistral 7B for function calling using LoRA/QLoRA on Apple Silicon with the xLAM dataset.
This project enables you to:
- Fine-tune Mistral 7B Instruct on function calling tasks
- Use the xLAM dataset (Salesforce's 60K function calling examples)
- Apply LoRA (Low-Rank Adaptation) for efficient fine-tuning
- Use QLoRA (4-bit quantization) for memory-constrained devices
- Run everything locally on Apple Silicon (M1/M2/M3/M4)
Tested Configuration:
- Apple M3 with 32GB unified memory
- macOS Sonoma or later
Recommended:
- M2 Ultra (64GB+): Best performance
- M3 Max (36GB+): Very good
- M1 Max (32GB): Works well
Minimum:
- M1 with 16GB (use QLoRA with batch size 1)
- Apple Silicon Mac with 16GB+ RAM
- Python 3.9+
- Hugging Face account and token (Get one here)
# Install dependencies
pip install -r requirements.txt
# Configure Hugging Face token
cp .env.example .env
# Edit .env and add: HF_TOKEN=your_token_here./run_pipeline.shThis will:
- Download and prepare 1000 xLAM samples (800 train, 100 valid, 100 test)
- Convert Mistral 7B to MLX format with 4-bit quantization
- Train with LoRA for 600 iterations
- Run test suite with example prompts
Expected time: ~30-40 minutes on M3 32GB
python prepare_xlam_data.py \
--train-samples 800 \
--valid-samples 100 \
--test-samples 100Creates data/xlam/ with train/valid/test splits.
python convert.py \
--hf-path mistralai/Mistral-7B-Instruct-v0.2 \
--mlx-path mlx_model \
-qThe -q flag enables 4-bit quantization (QLoRA), reducing memory from ~14GB to ~4GB.
python train_function_calling.py \
--model mlx_model \
--data data/xlam \
--train \
--batch-size 2 \
--iters 600 \
--learning-rate 1e-5 \
--lora-layers 16For M1 with 16GB:
python train_function_calling.py \
--model mlx_model \
--data data/xlam \
--train \
--batch-size 1 \
--lora-layers 8 \
--iters 400# Run automated test suite
python test_function_calling.py \
--model mlx_model \
--adapter adapters.npz
# Interactive mode
python test_function_calling.py \
--model mlx_model \
--adapter adapters.npz \
--interactive
# Test with specific prompt
python test_function_calling.py \
--model mlx_model \
--adapter adapters.npz \
--prompt '<user>What is the weather in Tokyo?</user>\n\n<tools>'The xLAM dataset uses this format:
<user>[user query]</user>
<tools>[available tool definitions]</tools>
<calls>[expected function calls]</calls>
Example:
<user>Check if the numbers 8 and 1233 are powers of two.</user>
<tools>{'name': 'is_power_of_two', 'description': 'Checks if a number is a power of two.', 'parameters': {'num': {'description': 'The number to check.', 'type': 'int'}}}</tools>
<calls>{'name': 'is_power_of_two', 'arguments': {'num': 8}}
{'name': 'is_power_of_two', 'arguments': {'num': 1233}}</calls>
- LoRA Rank: 8 (number of low-rank matrices)
- LoRA Layers: 16 (fine-tune last 16 transformer layers)
- Target Modules: Q and V projections in attention layers
- Trainable Parameters: ~1-2% of total model parameters
| Parameter | Default | Description |
|---|---|---|
--batch-size |
2 | Training batch size |
--iters |
600 | Number of training iterations |
--learning-rate |
1e-5 | Adam learning rate |
--lora-layers |
16 | Number of layers to fine-tune |
--lora-rank |
8 | LoRA rank parameter |
--steps-per-report |
10 | Log training loss every N steps |
--steps-per-eval |
100 | Evaluate on validation set every N steps |
--save-every |
100 | Save checkpoint every N iterations |
On M3 32GB with batch size 2:
- ~150-200 tokens/second
- ~600 iterations in 20-30 minutes
Training progress indicators:
Iter 1: Val loss 1.512 ← Baseline
Iter 100: Val loss 1.200 ← Should drop ~15-25%
Iter 200: Val loss 1.050 ← Continuing to drop
Iter 300: Val loss 0.950 ← Good progress
Signs to stop early:
- Validation loss plateaued or increasing (overfitting)
- Model gives reasonable responses in tests
- Training loss << 0.5
- Base model: ~4GB
- Training peak: ~12-16GB
- Recommended for: 16GB+ systems
- Base model: ~14GB
- Training peak: ~22-28GB
- Recommended for: 32GB+ systems
python train_function_calling.py \
--model mlx_model \
--train \
--resume-adapter-file adapters.npzpython train_function_calling.py \
--model mlx_model \
--adapter-file adapters.npz \
--testpython prepare_xlam_data.py \
--train-samples 4000 \
--valid-samples 500 \
--test-samples 500python test_function_calling.py \
--model mlx_model \
--adapter adapters.npz \
--max-tokens 300 \
--temp 0.5mlx_lora_fine_tuning/
├── README.md # This file
├── LICENSE # MIT License
├── requirements.txt # Python dependencies
├── .env.example # Environment configuration template
├── .gitignore # Git ignore rules
│
├── prepare_xlam_data.py # Data preprocessing script
├── convert.py # Model conversion to MLX
├── train_function_calling.py # Training script
├── test_function_calling.py # Testing/inference script
├── run_pipeline.sh # Complete automation script
│
├── models.py # Model architecture and LoRA
├── utils.py # Utility functions
│
├── data/xlam/ # Dataset directory (created by scripts)
│ ├── train.jsonl
│ ├── valid.jsonl
│ └── test.jsonl
│
├── mlx_model/ # Converted model (created by convert.py)
└── adapters.npz # Trained LoRA weights
- Use QLoRA: Add
-qflag toconvert.py - Reduce batch size: Use
--batch-size 1 - Reduce LoRA layers: Use
--lora-layers 8or--lora-layers 4 - Close other applications to free memory
# Option 1: Environment variable
export HF_TOKEN=your_token
# Option 2: CLI login
huggingface-cli login
# Option 3: .env file (recommended)
echo "HF_TOKEN=your_token" > .env- Close other applications to free memory
- Ensure you're using QLoRA (
-qflag in convert.py) - Try smaller batch size (
--batch-size 1) - Monitor Activity Monitor for memory pressure
- Increase iterations (
--iters 1000) - Adjust learning rate (
--learning-rate 5e-5) - Increase LoRA layers (
--lora-layers 24) - Check that validation loss is decreasing
- Ensure good ventilation
- Use a cooling pad if available
- Stop training (Ctrl+C), let cool for 10 minutes
- Resume with:
--resume-adapter-file adapters.npz.iter_XXX
xLAM Dataset:
@misc{xlam2024,
title={xLAM: A Family of Large Action Models to Empower AI Agent Systems},
author={Salesforce Research},
year={2024},
publisher={Hugging Face}
}LoRA:
@article{hu2021lora,
title={LoRA: Low-Rank Adaptation of Large Language Models},
author={Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
journal={arXiv preprint arXiv:2106.09685},
year={2021}
}QLoRA:
@article{dettmers2023qlora,
title={QLoRA: Efficient Finetuning of Quantized LLMs},
author={Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
journal={arXiv preprint arXiv:2305.14314},
year={2023}
}This project is based on:
- Apple MLX Examples - LoRA implementation
- xLAM Dataset - Salesforce Research
- Mistral 7B - Mistral AI
MIT License - See LICENSE file for details.
This project is adapted from Apple's MLX Examples and follows the same MIT License.