Experimental project for fine-tuning Qwen3-4B-Instruct models using Unsloth's efficient training framework for various domain-specific tasks.
This project provides a flexible pipeline for fine-tuning Qwen3-4B-Instruct models using Unsloth. It supports training on custom datasets for various specialized tasks including text analysis, data extraction, and structured output generation.
Base Model: unsloth/Qwen3-4B-Instruct-2507
Training Method: LoRA (Low-Rank Adaptation)
Output Formats: LoRA adapters, merged HuggingFace model, GGUF (Q8_0, Q4_K_M)
Inference: Local Python or Ollama
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtOption A: Full pipeline with default config
# Trains model, creates Q8_0 and Q4_K_M GGUF, sets up Ollama
./scripts/train_and_quantize.shOption B: Training with custom config
# Copy and edit the example config
cp configs/example.yaml configs/my-model.yaml
# Edit configs/my-model.yaml with your settings
# Train with custom config
python app/train.py --config configs/my-model.yamlOption C: Parallel training on multiple GPUs (e.g., vast.ai)
# Train multiple models in parallel on different GPUs
./scripts/train_multi.sh configs/model1.yaml configs/model2.yaml configs/model3.yamlWhat happens during training:
- Loads datasets from config file
- Fine-tunes LoRA adapters on Qwen3-4B base model
- Merges LoRA weights into base model
- Converts merged model to GGUF format (configurable)
- Outputs to directories specified in config
Option A: Local Python inference (LoRA)
# Edit query.txt with your input
echo "Your CSS ranking query here" > query.txt
# Run inference
python app/inference.pyOption B: Ollama inference (GGUF, recommended)
# Start Ollama (if not already running)
ollama serve
# In another terminal, interactive chat
ollama run qwen3-css-ranker:4b-instruct
# Or use via API
curl http://localhost:11434/api/generate -d '{
"model": "qwen3-css-ranker:4b-instruct",
"prompt": "Your ranking query here"
}'configs/ # YAML configuration files
├── example.yaml # Documented example config
└── default.yaml # Default training config
app/ # Python application code
├── train.py # Main training script (--config flag)
├── inference.py # Local inference with LoRA
├── utils/ # Utility modules
│ ├── dataset_loader.py # Dataset loading utilities
│ ├── filter_dataset.py # Filter by token count
│ └── analyze_tokens.py # Token analysis
└── experimental/ # Experimental inference variants
├── inference_with_tools.py # LangChain tool calling
├── inference_json_api.py # JSON API interface
├── inference_dynamic_tools.py # Dynamic tool names
└── ollama_tool_server.py # Ollama tool server
scripts/ # Shell scripts
├── train_and_quantize.sh # Single model training + quantization
└── train_multi.sh # Parallel multi-GPU training launcher
datasets/ # Training data (JSONL format)
├── train-dataset.jsonl # Raw training data
├── train-dataset-filtered.jsonl # Filtered training data
├── eval-dataset.jsonl # Raw evaluation data
└── eval-dataset-filtered.jsonl # Filtered evaluation data
models/ # Per-model outputs (when using custom configs)
├── my-model/
│ ├── lora/ # LoRA adapters
│ ├── merged/ # Merged HuggingFace model
│ └── gguf/ # GGUF files
runs/ # TensorBoard logs (for monitoring)
model/ # Default GGUF output (legacy)
lora_model/ # Default LoRA output (legacy)
model_merged/ # Default merged output (legacy)
python app/utils/filter_dataset.pyFilters training and evaluation datasets to a maximum of 24,576 tokens per example, creating filtered versions suitable for training.
python app/utils/analyze_tokens.pyShows token length statistics for your dataset:
- Min/max/average/median token lengths
- Standard deviation
- Distribution across token ranges
- Padding overhead analysis
Training datasets use JSONL format with conversation structure. Each line is a JSON object:
{
"conversations": [
{
"role": "system",
"content": "You are a helpful assistant..."
},
{
"role": "user",
"content": "Your task instruction here..."
},
{
"role": "assistant",
"content": "The response based on the task..."
}
]
}The dataset_loader.py supports multiple JSONL formats:
- Conversations format (recommended):
{"conversations": [{role, content}, ...]} - OpenAI messages format:
{"messages": [{role, content}, ...]} - Simple Q&A:
{"user": "...", "assistant": "..."} - Input-Output:
{"input": "...", "output": "..."} - Instruction-Response:
{"instruction": "...", "response": "..."} - Prompt-Completion:
{"prompt": "...", "completion": "..."}
All formats are automatically converted to standard conversation format during loading.
| Format | Size | Location | Use Case | Speed |
|---|---|---|---|---|
| LoRA Adapter | ~100MB | lora_model/ |
Fine-tuning, further training | N/A |
| Merged HF | ~7.5GB | model_merged/ |
GGUF conversion source | N/A |
| GGUF Q8_0 | ~4.0GB | model/model-trained.Q8_0.gguf |
High quality, minimal loss | ~10 tokens/sec |
| GGUF Q4_K_M | ~2.4GB | model/model-trained.Q4_K_M.gguf |
Smaller, faster, good quality | ~20 tokens/sec |
Switch between GGUF variants:
# Update Modelfile to use Q4_K_M instead of Q8_0
sed -i 's|model-trained.Q8_0|model-trained.Q4_K_M|' Modelfile
# Recreate Ollama model
ollama rm qwen3-css-ranker:4b-instruct
ollama create qwen3-css-ranker:4b-instruct -f Modelfile
# Now use the new model
ollama run qwen3-css-ranker:4b-instructTraining is configured via YAML files in the configs/ directory. See configs/example.yaml for a fully documented template.
# Model settings
model:
name: "unsloth/Qwen3-4B-Instruct-2507"
max_seq_length: 24576
load_in_4bit: true
# LoRA settings
lora:
rank: 16
alpha: 32
dropout: 0
target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
# Dataset paths
dataset:
train_file: "./datasets/train-dataset-filtered.jsonl"
eval_file: "./datasets/eval-dataset-filtered.jsonl"
# Training hyperparameters
training:
batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1e-4
num_epochs: 5 # Use num_epochs OR max_steps
# max_steps: 500
warmup_steps: 30
weight_decay: 0.01
seed: 3407
# Output directories
output:
lora_dir: "./models/my-model/lora"
merged_dir: "./models/my-model/merged"
gguf_dir: "./models/my-model/gguf"
# GGUF export settings
export:
gguf: true # Set to false to skip GGUF conversion
quantizations: [q8_0] # Quantization types to create
# Logging (TensorBoard)
logging:
report_to: "tensorboard" # Options: none, tensorboard, wandb
logging_steps: 1
tensorboard_dir: "./runs"temperature 0.7 # Randomness (0=deterministic, 1=creative)
top_p 0.8 # Nucleus sampling threshold
top_k 20 # Top-k sampling
num_ctx 24576 # Context window size
When report_to: "tensorboard" is set in your config, training logs are saved to ./runs/.
View training progress:
# Start TensorBoard
tensorboard --logdir=./runs --port=6006
# Open http://localhost:6006 in your browserFor vast.ai: Use port forwarding to access TensorBoard remotely, or the built-in vast.ai port forwarding feature.
The parallel training launcher (train_multi.sh) automatically starts TensorBoard for you.
Edit your config file to reduce memory usage:
# In your config.yaml, reduce these values:
model:
max_seq_length: 16384 # Reduce from 24576
training:
batch_size: 1
gradient_accumulation_steps: 1 # Reduce from 2Unsloth downloads and compiles llama.cpp automatically on first run. If you encounter issues:
# Clear cache and rebuild
rm -rf unsloth_compiled_cache/
python app/train.py --config configs/default.yaml # Will recompile- Ensure
model_merged/exists with full model weights - Check that llama.cpp was compiled correctly
- Run training again from scratch if needed
If you get ModuleNotFoundError: No module named 'app', run from the project root:
# Make sure you're in the project root directory
cd /path/to/unsloth-html-training
python app/train.py # Correct
# NOT: cd app && python train.py # Wrong- Prepare your dataset in JSONL format (any supported format)
- Place in
datasets/directory - Create a config file with your dataset paths:
dataset: train_file: "./datasets/my-train-data.jsonl" eval_file: "./datasets/my-eval-data.jsonl"
- Run training:
python app/train.py --config configs/my-config.yaml
Train multiple models simultaneously on a multi-GPU machine:
# Create separate configs for each model
cp configs/example.yaml configs/model1.yaml
cp configs/example.yaml configs/model2.yaml
cp configs/example.yaml configs/model3.yaml
# Edit each config with different datasets/outputs
# ...
# Launch all on separate GPUs (4x 4090 example)
./scripts/train_multi.sh configs/model1.yaml configs/model2.yaml configs/model3.yaml configs/model4.yaml
# Monitor all training runs in TensorBoard
# http://localhost:6006Train on top of an existing trained model:
# In app/train.py, change the model load to:
from peft import PeftModel, AutoModelForCausalLM
base_model = AutoModelForCausalLM.from_pretrained(
"unsloth/Qwen3-4B-Instruct-2507",
device_map="auto",
torch_dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(base_model, "./lora_model")For advanced use cases, try the experimental inference scripts:
# LangChain tool calling
python app/experimental/inference_with_tools.py
# JSON API interface
python app/experimental/inference_json_api.py
# Dynamic tool naming (TypeScript frameworks)
python app/experimental/inference_dynamic_tools.py
# Ollama tool server (subprocess callable)
python app/experimental/ollama_tool_server.py <tool_name> [query_file]- Use Q4_K_M for inference - 2.4GB model with good quality (60% smaller than Q8_0)
- Filter your dataset - Run
python app/utils/filter_dataset.pybefore training - Analyze token distribution - Use
python app/utils/analyze_tokens.pyto optimize settings - Adjust LoRA rank - Lower rank (8) trains faster, higher rank (32) more flexible
- Use gradient accumulation - Simulates larger batch size without more GPU memory
Minimum:
- GPU with 12GB VRAM (RTX 3060, RTX 4060Ti, A40)
- 50GB storage (models + datasets)
- Python 3.10+
Recommended:
- GPU with 16GB+ VRAM (RTX 4070 or better)
- 100GB storage (for multiple quantizations)
- 32GB+ system RAM
- Prepare data: JSONL format with conversations
- Filter data:
python app/utils/filter_dataset.py - Analyze tokens:
python app/utils/analyze_tokens.py - Train:
./scripts/train_and_quantize.sh(full pipeline) - Infer:
ollama run qwen3-css-ranker:4b-instruct
This project uses:
- Unsloth: LGPL-3.0 license
- Qwen3 Model: Apache 2.0 license
- Project code: MIT (or your chosen license)
If you encounter issues:
- Check the Troubleshooting section above
- Review the plan file:
/path/to/.claude/plans/parsed-kindling-church.md - Check logs from training for specific errors
- Ensure your dataset is in the correct format
The restructured project is designed to be simple:
-
Setup (one time):
python -m venv venv source venv/bin/activate pip install -r requirements.txt -
Prepare data:
- Place JSONL files in
datasets/ - Run:
python app/utils/filter_dataset.py
- Place JSONL files in
-
Configure (edit YAML, not Python):
- Copy:
cp configs/example.yaml configs/my-model.yaml - Edit
configs/my-model.yamlwith your dataset paths and settings
- Copy:
-
Train:
# With default config ./scripts/train_and_quantize.sh # Or with custom config ./scripts/train_and_quantize.sh configs/my-model.yaml
-
Use trained model:
ollama run qwen3-css-ranker:4b-instruct # Type your queries, model responds
All Python complexity is hidden. You only need to:
- Edit YAML config files (not Python code)
- Run shell scripts
- Check outputs in the directories specified in your config