This project focuses on using fine-tuning to improve the performance of a small LLM (0.6B parameters) on a challenging task: producing structured output. Specifically, the model is trained to extract relevant information from books’ tables of contents—including title, start page, and end page—in JSON format. This task is particularly difficult for small models due to the high variability in the layout of books’ tables of contents.
This repository provides a complete pipeline for fine-tuning language models to extract structured tables of contents (TOC) from noisy, real-world documents. Using parameter-efficient LoRA fine-tuning, the project transforms imperfect TOC text into clean, structured JSON format.
Fine-tunes Qwen3-0.6B to robustly parse messy table of contents and extract:
- Chapter numbers and titles
- Start and end page ranges
- Clean JSON structure from noisy inputs
The model learns to handle OCR artifacts, formatting inconsistencies, random characters, and missing information commonly found in real document TOCs.
Before Fine-tuning: Base model struggles with page ranges and noisy inputs
{"chapter_number": 1, "start_page": 25, "end_page": 25} // ❌ Missing page rangeAfter Fine-tuning: Properly calculates page ranges and handles noise
{"chapter_number": "1", "start_page": 25, "end_page": 67} // ✅ Correct end_page calculation- Converts LLM-distilled
.docxfiles to structured JSON - Processes chapter-based content with consistent formatting
- Creates 15,000 training examples with configurable noise parameters
- Applies realistic document extraction errors (OCR artifacts, formatting issues)
- Generates paired examples: noisy TOC → clean JSON
- Uses LoRA (Low-Rank Adaptation) for parameter-efficient training
- Optimized for Google Colab with 4-bit quantization
- Exports models in multiple formats:
- 16-bit merged model for maximum quality
- 4-bit GGUF for efficient deployment
- Side-by-side comparison of base vs fine-tuned models
- Demonstrates improved page range calculation and noise handling
- Google Colab account with GPU runtime
- Hugging Face account (for model access)
Pre-generated synthetic training data is available in the data/ folder. You can directly proceed to fine-tuning.
Run generate_synthetic_data.ipynb locally to create custom training data with your own noise parameters.
Important: The following notebooks are optimized for Google Colab and should be run there:
A fine-tuned 16-bit merged version is available on Hugging Face: 🤗 Fine-tuned TOC Extractor Model
├── notebooks/
│ ├── load_distilled_data.ipynb # DOCX → JSON processing
│ ├── generate_synthetic_data.ipynb # Synthetic training data
│ ├── finetuning.ipynb # LoRA fine-tuning (Colab optimized)
│ └── models_benchmarking.ipynb # Performance comparison (Colab optimized)
├── src/ # Supporting Python modules
├── data/ # Pre-generated training data
├── synthetic_data/ # Generated data and models
└── README.md
- Document Processing Pipelines: Extract chapter structure from academic papers, technical manuals
- Digital Library Systems: Automatically structure document collections
- Publishing Workflows: Process manuscripts with inconsistent formatting
- Research Applications: Handle noisy OCR outputs from scanned documents
- Base Model: Qwen3-0.6B (4-bit quantization)
- Training Method: LoRA adapters (r=16, α=32)
- Dataset: 15,000 synthetic examples with configurable noise
- Training Time: ~1-2 hours on Colab T4 GPU
- Memory Efficient: 4-bit quantization + gradient checkpointing
The fine-tuned model demonstrates significant improvements:
- ✅ Accurate page range calculation (end_page = next_start_page - 1)
- ✅ Noise resistance to formatting inconsistencies and OCR errors
- ✅ Consistent JSON output with required fields
- ✅ Content filtering ignores exercises, decorative elements, standalone numbers