Fine-tuning Qwen Models with Unsloth (Experiment)

Experimental project for fine-tuning Qwen3-4B-Instruct models using Unsloth's efficient training framework for various domain-specific tasks.

Overview

This project provides a flexible pipeline for fine-tuning Qwen3-4B-Instruct models using Unsloth. It supports training on custom datasets for various specialized tasks including text analysis, data extraction, and structured output generation.

Base Model: unsloth/Qwen3-4B-Instruct-2507 Training Method: LoRA (Low-Rank Adaptation) Output Formats: LoRA adapters, merged HuggingFace model, GGUF (Q8_0, Q4_K_M) Inference: Local Python or Ollama

Quick Start

1. Installation

# Create virtual environment
python -m venv venv
source venv/bin/activate          # Linux/Mac
# venv\Scripts\activate            # Windows

# Install dependencies
pip install -r requirements.txt

2. Training

Option A: Full pipeline with default config

# Trains model, creates Q8_0 and Q4_K_M GGUF, sets up Ollama
./scripts/train_and_quantize.sh

Option B: Training with custom config

# Copy and edit the example config
cp configs/example.yaml configs/my-model.yaml
# Edit configs/my-model.yaml with your settings

# Train with custom config
python app/train.py --config configs/my-model.yaml

Option C: Parallel training on multiple GPUs (e.g., vast.ai)

# Train multiple models in parallel on different GPUs
./scripts/train_multi.sh configs/model1.yaml configs/model2.yaml configs/model3.yaml

What happens during training:

Loads datasets from config file
Fine-tunes LoRA adapters on Qwen3-4B base model
Merges LoRA weights into base model
Converts merged model to GGUF format (configurable)
Outputs to directories specified in config

3. Inference

Option A: Local Python inference (LoRA)

# Edit query.txt with your input
echo "Your CSS ranking query here" > query.txt

# Run inference
python app/inference.py

Option B: Ollama inference (GGUF, recommended)

# Start Ollama (if not already running)
ollama serve

# In another terminal, interactive chat
ollama run qwen3-css-ranker:4b-instruct

# Or use via API
curl http://localhost:11434/api/generate -d '{
  "model": "qwen3-css-ranker:4b-instruct",
  "prompt": "Your ranking query here"
}'

Project Structure

configs/                             # YAML configuration files
├── example.yaml                     # Documented example config
└── default.yaml                     # Default training config

app/                                 # Python application code
├── train.py                         # Main training script (--config flag)
├── inference.py                     # Local inference with LoRA
├── utils/                           # Utility modules
│   ├── dataset_loader.py           # Dataset loading utilities
│   ├── filter_dataset.py           # Filter by token count
│   └── analyze_tokens.py           # Token analysis
└── experimental/                    # Experimental inference variants
    ├── inference_with_tools.py     # LangChain tool calling
    ├── inference_json_api.py       # JSON API interface
    ├── inference_dynamic_tools.py  # Dynamic tool names
    └── ollama_tool_server.py       # Ollama tool server

scripts/                             # Shell scripts
├── train_and_quantize.sh           # Single model training + quantization
└── train_multi.sh                  # Parallel multi-GPU training launcher

datasets/                            # Training data (JSONL format)
├── train-dataset.jsonl              # Raw training data
├── train-dataset-filtered.jsonl     # Filtered training data
├── eval-dataset.jsonl               # Raw evaluation data
└── eval-dataset-filtered.jsonl      # Filtered evaluation data

models/                              # Per-model outputs (when using custom configs)
├── my-model/
│   ├── lora/                       # LoRA adapters
│   ├── merged/                     # Merged HuggingFace model
│   └── gguf/                       # GGUF files

runs/                                # TensorBoard logs (for monitoring)

model/                               # Default GGUF output (legacy)
lora_model/                          # Default LoRA output (legacy)
model_merged/                        # Default merged output (legacy)

Utilities

Filter Dataset by Token Count

python app/utils/filter_dataset.py

Filters training and evaluation datasets to a maximum of 24,576 tokens per example, creating filtered versions suitable for training.

Analyze Token Distribution

python app/utils/analyze_tokens.py

Shows token length statistics for your dataset:

Min/max/average/median token lengths
Standard deviation
Distribution across token ranges
Padding overhead analysis

Dataset Format

Training datasets use JSONL format with conversation structure. Each line is a JSON object:

{
  "conversations": [
    {
      "role": "system",
      "content": "You are a helpful assistant..."
    },
    {
      "role": "user",
      "content": "Your task instruction here..."
    },
    {
      "role": "assistant",
      "content": "The response based on the task..."
    }
  ]
}

Supported Input Formats

The dataset_loader.py supports multiple JSONL formats:

Conversations format (recommended): {"conversations": [{role, content}, ...]}
OpenAI messages format: {"messages": [{role, content}, ...]}
Simple Q&A: {"user": "...", "assistant": "..."}
Input-Output: {"input": "...", "output": "..."}
Instruction-Response: {"instruction": "...", "response": "..."}
Prompt-Completion: {"prompt": "...", "completion": "..."}

All formats are automatically converted to standard conversation format during loading.

Model Variants and Sizes

Format	Size	Location	Use Case	Speed
LoRA Adapter	~100MB	`lora_model/`	Fine-tuning, further training	N/A
Merged HF	~7.5GB	`model_merged/`	GGUF conversion source	N/A
GGUF Q8_0	~4.0GB	`model/model-trained.Q8_0.gguf`	High quality, minimal loss	~10 tokens/sec
GGUF Q4_K_M	~2.4GB	`model/model-trained.Q4_K_M.gguf`	Smaller, faster, good quality	~20 tokens/sec

Switch between GGUF variants:

# Update Modelfile to use Q4_K_M instead of Q8_0
sed -i 's|model-trained.Q8_0|model-trained.Q4_K_M|' Modelfile

# Recreate Ollama model
ollama rm qwen3-css-ranker:4b-instruct
ollama create qwen3-css-ranker:4b-instruct -f Modelfile

# Now use the new model
ollama run qwen3-css-ranker:4b-instruct

Configuration

Training is configured via YAML files in the configs/ directory. See configs/example.yaml for a fully documented template.

Config File Structure

# Model settings
model:
  name: "unsloth/Qwen3-4B-Instruct-2507"
  max_seq_length: 24576
  load_in_4bit: true

# LoRA settings
lora:
  rank: 16
  alpha: 32
  dropout: 0
  target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]

# Dataset paths
dataset:
  train_file: "./datasets/train-dataset-filtered.jsonl"
  eval_file: "./datasets/eval-dataset-filtered.jsonl"

# Training hyperparameters
training:
  batch_size: 1
  gradient_accumulation_steps: 2
  learning_rate: 1e-4
  num_epochs: 5              # Use num_epochs OR max_steps
  # max_steps: 500
  warmup_steps: 30
  weight_decay: 0.01
  seed: 3407

# Output directories
output:
  lora_dir: "./models/my-model/lora"
  merged_dir: "./models/my-model/merged"
  gguf_dir: "./models/my-model/gguf"

# GGUF export settings
export:
  gguf: true                 # Set to false to skip GGUF conversion
  quantizations: [q8_0]      # Quantization types to create

# Logging (TensorBoard)
logging:
  report_to: "tensorboard"   # Options: none, tensorboard, wandb
  logging_steps: 1
  tensorboard_dir: "./runs"

Inference Parameters (in `Modelfile`)

temperature 0.7             # Randomness (0=deterministic, 1=creative)
top_p 0.8                   # Nucleus sampling threshold
top_k 20                    # Top-k sampling
num_ctx 24576               # Context window size

Monitoring Training with TensorBoard

When report_to: "tensorboard" is set in your config, training logs are saved to ./runs/.

View training progress:

# Start TensorBoard
tensorboard --logdir=./runs --port=6006

# Open http://localhost:6006 in your browser

For vast.ai: Use port forwarding to access TensorBoard remotely, or the built-in vast.ai port forwarding feature.

The parallel training launcher (train_multi.sh) automatically starts TensorBoard for you.

Troubleshooting

Out of Memory (OOM) During Training

Edit your config file to reduce memory usage:

# In your config.yaml, reduce these values:
model:
  max_seq_length: 16384      # Reduce from 24576

training:
  batch_size: 1
  gradient_accumulation_steps: 1  # Reduce from 2

llama.cpp Not Found

Unsloth downloads and compiles llama.cpp automatically on first run. If you encounter issues:

# Clear cache and rebuild
rm -rf unsloth_compiled_cache/
python app/train.py --config configs/default.yaml  # Will recompile

GGUF Conversion Fails

Ensure model_merged/ exists with full model weights
Check that llama.cpp was compiled correctly
Run training again from scratch if needed

Import Errors

If you get ModuleNotFoundError: No module named 'app', run from the project root:

# Make sure you're in the project root directory
cd /path/to/unsloth-html-training
python app/train.py          # Correct
# NOT: cd app && python train.py  # Wrong

Advanced Usage

Training with Custom Dataset

Prepare your dataset in JSONL format (any supported format)
Place in datasets/ directory

Create a config file with your dataset paths:

dataset:
  train_file: "./datasets/my-train-data.jsonl"
  eval_file: "./datasets/my-eval-data.jsonl"

Run training: python app/train.py --config configs/my-config.yaml

Parallel Multi-GPU Training (vast.ai)

Train multiple models simultaneously on a multi-GPU machine:

# Create separate configs for each model
cp configs/example.yaml configs/model1.yaml
cp configs/example.yaml configs/model2.yaml
cp configs/example.yaml configs/model3.yaml

# Edit each config with different datasets/outputs
# ...

# Launch all on separate GPUs (4x 4090 example)
./scripts/train_multi.sh configs/model1.yaml configs/model2.yaml configs/model3.yaml configs/model4.yaml

# Monitor all training runs in TensorBoard
# http://localhost:6006

Further Fine-tuning

Train on top of an existing trained model:

# In app/train.py, change the model load to:
from peft import PeftModel, AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained(
    "unsloth/Qwen3-4B-Instruct-2507",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(base_model, "./lora_model")

Experimental Inference Variants

For advanced use cases, try the experimental inference scripts:

# LangChain tool calling
python app/experimental/inference_with_tools.py

# JSON API interface
python app/experimental/inference_json_api.py

# Dynamic tool naming (TypeScript frameworks)
python app/experimental/inference_dynamic_tools.py

# Ollama tool server (subprocess callable)
python app/experimental/ollama_tool_server.py <tool_name> [query_file]

Performance Tips

Use Q4_K_M for inference - 2.4GB model with good quality (60% smaller than Q8_0)
Filter your dataset - Run python app/utils/filter_dataset.py before training
Analyze token distribution - Use python app/utils/analyze_tokens.py to optimize settings
Adjust LoRA rank - Lower rank (8) trains faster, higher rank (32) more flexible
Use gradient accumulation - Simulates larger batch size without more GPU memory

System Requirements

Minimum:

GPU with 12GB VRAM (RTX 3060, RTX 4060Ti, A40)
50GB storage (models + datasets)
Python 3.10+

Recommended:

GPU with 16GB+ VRAM (RTX 4070 or better)
100GB storage (for multiple quantizations)
32GB+ system RAM

Workflow Summary

Prepare data: JSONL format with conversations
Filter data: python app/utils/filter_dataset.py
Analyze tokens: python app/utils/analyze_tokens.py
Train: ./scripts/train_and_quantize.sh (full pipeline)
Infer: ollama run qwen3-css-ranker:4b-instruct

License

This project uses:

Unsloth: LGPL-3.0 license
Qwen3 Model: Apache 2.0 license
Project code: MIT (or your chosen license)

Getting Help

If you encounter issues:

Check the Troubleshooting section above
Review the plan file: /path/to/.claude/plans/parsed-kindling-church.md
Check logs from training for specific errors
Ensure your dataset is in the correct format

Workflow for Non-Python Users

The restructured project is designed to be simple:

Setup (one time):

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Prepare data:
- Place JSONL files in datasets/
- Run: python app/utils/filter_dataset.py
Configure (edit YAML, not Python):
- Copy: cp configs/example.yaml configs/my-model.yaml
- Edit configs/my-model.yaml with your dataset paths and settings

Train:

# With default config
./scripts/train_and_quantize.sh

# Or with custom config
./scripts/train_and_quantize.sh configs/my-model.yaml

Use trained model:

ollama run qwen3-css-ranker:4b-instruct
# Type your queries, model responds

All Python complexity is hidden. You only need to:

Edit YAML config files (not Python code)
Run shell scripts
Check outputs in the directories specified in your config

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
app		app
archive		archive
configs		configs
scripts		scripts
.gitignore		.gitignore
Modelfile		Modelfile
ModelfileQ4		ModelfileQ4
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Fine-tuning Qwen Models with Unsloth (Experiment)

Overview

Quick Start

1. Installation

2. Training

3. Inference

Project Structure

Utilities

Filter Dataset by Token Count

Analyze Token Distribution

Dataset Format

Supported Input Formats

Model Variants and Sizes

Configuration

Config File Structure

Inference Parameters (in Modelfile)

Monitoring Training with TensorBoard

Troubleshooting

Out of Memory (OOM) During Training

llama.cpp Not Found

GGUF Conversion Fails

Import Errors

Advanced Usage

Training with Custom Dataset

Parallel Multi-GPU Training (vast.ai)

Further Fine-tuning

Experimental Inference Variants

Performance Tips

System Requirements

Workflow Summary

License

Getting Help

Workflow for Non-Python Users

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Inference Parameters (in `Modelfile`)

Packages