Skip to content

falmar/unsloth-training-exp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fine-tuning Qwen Models with Unsloth (Experiment)

Experimental project for fine-tuning Qwen3-4B-Instruct models using Unsloth's efficient training framework for various domain-specific tasks.

Overview

This project provides a flexible pipeline for fine-tuning Qwen3-4B-Instruct models using Unsloth. It supports training on custom datasets for various specialized tasks including text analysis, data extraction, and structured output generation.

Base Model: unsloth/Qwen3-4B-Instruct-2507 Training Method: LoRA (Low-Rank Adaptation) Output Formats: LoRA adapters, merged HuggingFace model, GGUF (Q8_0, Q4_K_M) Inference: Local Python or Ollama

Quick Start

1. Installation

# Create virtual environment
python -m venv venv
source venv/bin/activate          # Linux/Mac
# venv\Scripts\activate            # Windows

# Install dependencies
pip install -r requirements.txt

2. Training

Option A: Full pipeline with default config

# Trains model, creates Q8_0 and Q4_K_M GGUF, sets up Ollama
./scripts/train_and_quantize.sh

Option B: Training with custom config

# Copy and edit the example config
cp configs/example.yaml configs/my-model.yaml
# Edit configs/my-model.yaml with your settings

# Train with custom config
python app/train.py --config configs/my-model.yaml

Option C: Parallel training on multiple GPUs (e.g., vast.ai)

# Train multiple models in parallel on different GPUs
./scripts/train_multi.sh configs/model1.yaml configs/model2.yaml configs/model3.yaml

What happens during training:

  1. Loads datasets from config file
  2. Fine-tunes LoRA adapters on Qwen3-4B base model
  3. Merges LoRA weights into base model
  4. Converts merged model to GGUF format (configurable)
  5. Outputs to directories specified in config

3. Inference

Option A: Local Python inference (LoRA)

# Edit query.txt with your input
echo "Your CSS ranking query here" > query.txt

# Run inference
python app/inference.py

Option B: Ollama inference (GGUF, recommended)

# Start Ollama (if not already running)
ollama serve

# In another terminal, interactive chat
ollama run qwen3-css-ranker:4b-instruct

# Or use via API
curl http://localhost:11434/api/generate -d '{
  "model": "qwen3-css-ranker:4b-instruct",
  "prompt": "Your ranking query here"
}'

Project Structure

configs/                             # YAML configuration files
├── example.yaml                     # Documented example config
└── default.yaml                     # Default training config

app/                                 # Python application code
├── train.py                         # Main training script (--config flag)
├── inference.py                     # Local inference with LoRA
├── utils/                           # Utility modules
│   ├── dataset_loader.py           # Dataset loading utilities
│   ├── filter_dataset.py           # Filter by token count
│   └── analyze_tokens.py           # Token analysis
└── experimental/                    # Experimental inference variants
    ├── inference_with_tools.py     # LangChain tool calling
    ├── inference_json_api.py       # JSON API interface
    ├── inference_dynamic_tools.py  # Dynamic tool names
    └── ollama_tool_server.py       # Ollama tool server

scripts/                             # Shell scripts
├── train_and_quantize.sh           # Single model training + quantization
└── train_multi.sh                  # Parallel multi-GPU training launcher

datasets/                            # Training data (JSONL format)
├── train-dataset.jsonl              # Raw training data
├── train-dataset-filtered.jsonl     # Filtered training data
├── eval-dataset.jsonl               # Raw evaluation data
└── eval-dataset-filtered.jsonl      # Filtered evaluation data

models/                              # Per-model outputs (when using custom configs)
├── my-model/
│   ├── lora/                       # LoRA adapters
│   ├── merged/                     # Merged HuggingFace model
│   └── gguf/                       # GGUF files

runs/                                # TensorBoard logs (for monitoring)

model/                               # Default GGUF output (legacy)
lora_model/                          # Default LoRA output (legacy)
model_merged/                        # Default merged output (legacy)

Utilities

Filter Dataset by Token Count

python app/utils/filter_dataset.py

Filters training and evaluation datasets to a maximum of 24,576 tokens per example, creating filtered versions suitable for training.

Analyze Token Distribution

python app/utils/analyze_tokens.py

Shows token length statistics for your dataset:

  • Min/max/average/median token lengths
  • Standard deviation
  • Distribution across token ranges
  • Padding overhead analysis

Dataset Format

Training datasets use JSONL format with conversation structure. Each line is a JSON object:

{
  "conversations": [
    {
      "role": "system",
      "content": "You are a helpful assistant..."
    },
    {
      "role": "user",
      "content": "Your task instruction here..."
    },
    {
      "role": "assistant",
      "content": "The response based on the task..."
    }
  ]
}

Supported Input Formats

The dataset_loader.py supports multiple JSONL formats:

  1. Conversations format (recommended): {"conversations": [{role, content}, ...]}
  2. OpenAI messages format: {"messages": [{role, content}, ...]}
  3. Simple Q&A: {"user": "...", "assistant": "..."}
  4. Input-Output: {"input": "...", "output": "..."}
  5. Instruction-Response: {"instruction": "...", "response": "..."}
  6. Prompt-Completion: {"prompt": "...", "completion": "..."}

All formats are automatically converted to standard conversation format during loading.

Model Variants and Sizes

Format Size Location Use Case Speed
LoRA Adapter ~100MB lora_model/ Fine-tuning, further training N/A
Merged HF ~7.5GB model_merged/ GGUF conversion source N/A
GGUF Q8_0 ~4.0GB model/model-trained.Q8_0.gguf High quality, minimal loss ~10 tokens/sec
GGUF Q4_K_M ~2.4GB model/model-trained.Q4_K_M.gguf Smaller, faster, good quality ~20 tokens/sec

Switch between GGUF variants:

# Update Modelfile to use Q4_K_M instead of Q8_0
sed -i 's|model-trained.Q8_0|model-trained.Q4_K_M|' Modelfile

# Recreate Ollama model
ollama rm qwen3-css-ranker:4b-instruct
ollama create qwen3-css-ranker:4b-instruct -f Modelfile

# Now use the new model
ollama run qwen3-css-ranker:4b-instruct

Configuration

Training is configured via YAML files in the configs/ directory. See configs/example.yaml for a fully documented template.

Config File Structure

# Model settings
model:
  name: "unsloth/Qwen3-4B-Instruct-2507"
  max_seq_length: 24576
  load_in_4bit: true

# LoRA settings
lora:
  rank: 16
  alpha: 32
  dropout: 0
  target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]

# Dataset paths
dataset:
  train_file: "./datasets/train-dataset-filtered.jsonl"
  eval_file: "./datasets/eval-dataset-filtered.jsonl"

# Training hyperparameters
training:
  batch_size: 1
  gradient_accumulation_steps: 2
  learning_rate: 1e-4
  num_epochs: 5              # Use num_epochs OR max_steps
  # max_steps: 500
  warmup_steps: 30
  weight_decay: 0.01
  seed: 3407

# Output directories
output:
  lora_dir: "./models/my-model/lora"
  merged_dir: "./models/my-model/merged"
  gguf_dir: "./models/my-model/gguf"

# GGUF export settings
export:
  gguf: true                 # Set to false to skip GGUF conversion
  quantizations: [q8_0]      # Quantization types to create

# Logging (TensorBoard)
logging:
  report_to: "tensorboard"   # Options: none, tensorboard, wandb
  logging_steps: 1
  tensorboard_dir: "./runs"

Inference Parameters (in Modelfile)

temperature 0.7             # Randomness (0=deterministic, 1=creative)
top_p 0.8                   # Nucleus sampling threshold
top_k 20                    # Top-k sampling
num_ctx 24576               # Context window size

Monitoring Training with TensorBoard

When report_to: "tensorboard" is set in your config, training logs are saved to ./runs/.

View training progress:

# Start TensorBoard
tensorboard --logdir=./runs --port=6006

# Open http://localhost:6006 in your browser

For vast.ai: Use port forwarding to access TensorBoard remotely, or the built-in vast.ai port forwarding feature.

The parallel training launcher (train_multi.sh) automatically starts TensorBoard for you.

Troubleshooting

Out of Memory (OOM) During Training

Edit your config file to reduce memory usage:

# In your config.yaml, reduce these values:
model:
  max_seq_length: 16384      # Reduce from 24576

training:
  batch_size: 1
  gradient_accumulation_steps: 1  # Reduce from 2

llama.cpp Not Found

Unsloth downloads and compiles llama.cpp automatically on first run. If you encounter issues:

# Clear cache and rebuild
rm -rf unsloth_compiled_cache/
python app/train.py --config configs/default.yaml  # Will recompile

GGUF Conversion Fails

  1. Ensure model_merged/ exists with full model weights
  2. Check that llama.cpp was compiled correctly
  3. Run training again from scratch if needed

Import Errors

If you get ModuleNotFoundError: No module named 'app', run from the project root:

# Make sure you're in the project root directory
cd /path/to/unsloth-html-training
python app/train.py          # Correct
# NOT: cd app && python train.py  # Wrong

Advanced Usage

Training with Custom Dataset

  1. Prepare your dataset in JSONL format (any supported format)
  2. Place in datasets/ directory
  3. Create a config file with your dataset paths:
    dataset:
      train_file: "./datasets/my-train-data.jsonl"
      eval_file: "./datasets/my-eval-data.jsonl"
  4. Run training: python app/train.py --config configs/my-config.yaml

Parallel Multi-GPU Training (vast.ai)

Train multiple models simultaneously on a multi-GPU machine:

# Create separate configs for each model
cp configs/example.yaml configs/model1.yaml
cp configs/example.yaml configs/model2.yaml
cp configs/example.yaml configs/model3.yaml

# Edit each config with different datasets/outputs
# ...

# Launch all on separate GPUs (4x 4090 example)
./scripts/train_multi.sh configs/model1.yaml configs/model2.yaml configs/model3.yaml configs/model4.yaml

# Monitor all training runs in TensorBoard
# http://localhost:6006

Further Fine-tuning

Train on top of an existing trained model:

# In app/train.py, change the model load to:
from peft import PeftModel, AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained(
    "unsloth/Qwen3-4B-Instruct-2507",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(base_model, "./lora_model")

Experimental Inference Variants

For advanced use cases, try the experimental inference scripts:

# LangChain tool calling
python app/experimental/inference_with_tools.py

# JSON API interface
python app/experimental/inference_json_api.py

# Dynamic tool naming (TypeScript frameworks)
python app/experimental/inference_dynamic_tools.py

# Ollama tool server (subprocess callable)
python app/experimental/ollama_tool_server.py <tool_name> [query_file]

Performance Tips

  1. Use Q4_K_M for inference - 2.4GB model with good quality (60% smaller than Q8_0)
  2. Filter your dataset - Run python app/utils/filter_dataset.py before training
  3. Analyze token distribution - Use python app/utils/analyze_tokens.py to optimize settings
  4. Adjust LoRA rank - Lower rank (8) trains faster, higher rank (32) more flexible
  5. Use gradient accumulation - Simulates larger batch size without more GPU memory

System Requirements

Minimum:

  • GPU with 12GB VRAM (RTX 3060, RTX 4060Ti, A40)
  • 50GB storage (models + datasets)
  • Python 3.10+

Recommended:

  • GPU with 16GB+ VRAM (RTX 4070 or better)
  • 100GB storage (for multiple quantizations)
  • 32GB+ system RAM

Workflow Summary

  1. Prepare data: JSONL format with conversations
  2. Filter data: python app/utils/filter_dataset.py
  3. Analyze tokens: python app/utils/analyze_tokens.py
  4. Train: ./scripts/train_and_quantize.sh (full pipeline)
  5. Infer: ollama run qwen3-css-ranker:4b-instruct

License

This project uses:

  • Unsloth: LGPL-3.0 license
  • Qwen3 Model: Apache 2.0 license
  • Project code: MIT (or your chosen license)

Getting Help

If you encounter issues:

  1. Check the Troubleshooting section above
  2. Review the plan file: /path/to/.claude/plans/parsed-kindling-church.md
  3. Check logs from training for specific errors
  4. Ensure your dataset is in the correct format

Workflow for Non-Python Users

The restructured project is designed to be simple:

  1. Setup (one time):

    python -m venv venv
    source venv/bin/activate
    pip install -r requirements.txt
  2. Prepare data:

    • Place JSONL files in datasets/
    • Run: python app/utils/filter_dataset.py
  3. Configure (edit YAML, not Python):

    • Copy: cp configs/example.yaml configs/my-model.yaml
    • Edit configs/my-model.yaml with your dataset paths and settings
  4. Train:

    # With default config
    ./scripts/train_and_quantize.sh
    
    # Or with custom config
    ./scripts/train_and_quantize.sh configs/my-model.yaml
  5. Use trained model:

    ollama run qwen3-css-ranker:4b-instruct
    # Type your queries, model responds

All Python complexity is hidden. You only need to:

  • Edit YAML config files (not Python code)
  • Run shell scripts
  • Check outputs in the directories specified in your config

About

Experimental project for fine-tuning Qwen3-4B-Instruct models using Unsloth's efficient training framework for various domain-specific tasks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors