PDF Spec Compliance Cleanup Roadmap

Goal: Remove all non-PDF-spec-compliant code from core extraction. Reorganize into:

Core: Spec-compliant extraction only
Enhancements: Optional, user-controlled features
Clear boundary: No mixed concerns

Current State: ~2,000+ lines of non-spec code across 12 areas (Phase 10 status: 4.4/10) Target State: Spec-strict core + optional enhancement layers (8.5+/10)

PHASE 1: Removal (Lowest Priority - Just Delete)

These features have NO DEPENDENCY from core extraction. Safe to remove entirely.

1.1 Table Detection Module

Location: src/layout/table_detector.rs (300+ lines) Status: Can be removed immediately Action:

Delete src/layout/table_detector.rs
Remove from src/layout/mod.rs exports
Remove from src/extractors/mod.rs if exported
Update any tests that reference table detection
Remove TableDetector, DetectedTable, TableDetectorConfig from public API

Why: Tables are semantic concepts NOT in PDF spec. Users should use structure tree (Section 14.7) if they want real table information.

Impact: ✅ No code breakage (optional feature) Effort: 30 minutes

1.2 Heading Detection Heuristics

Location: src/layout/heading_detector.rs (300+ lines) Status: Can be removed if not core to extraction Action:

Check if heading_detector is used in critical path
If yes: Keep but add spec_compliant: false flag
If no: Delete entire module
Remove hardcoded font size thresholds (22pt, 18pt, etc.)

Why: Font-based heading detection is linguistic interpretation, not PDF spec feature. Use structure tree for real heading info.

Impact: Depends on usage - potentially breaks heading detection Effort: 1 hour if deletable, 2 hours if keeping with annotations

1.3 ML Heading Classifier

Location: src/ml/heading_classifier.rs (200+ lines) Status: Remove or move to optional module Action:

Delete src/ml/heading_classifier.rs
Remove from src/ml/mod.rs
If ML module becomes empty, consider removing entire src/ml/
Remove all DistilBERT references from docs

Why: ML-based semantic analysis is antithetical to spec compliance. It's a proprietary classification layer.

Impact: ✅ No code breakage Effort: 30 minutes

PHASE 2: Migration to Optional (Medium Priority)

These are NEEDED for quality but NOT in PDF spec. Move to optional post-processing layer.

2.1 CamelCase Word Splitting

Location: src/extractors/text.rs:1467-1475, 2057-2141, 3671-3989 Current State: Already disabled but code still present Action:

Create new module: src/post_processors/word_splitter.rs
Move split_fused_words(), split_on_camelcase() to new module
Remove calls from main extraction pipeline
Add post-processing layer to text extraction with opt-in flag
Document: "Optional feature - not PDF spec-based"
Add unit tests: verify "theGeneral" → "the General" works
Remove dead code from text.rs: lines 3671-3989

Configuration:

pub struct TextExtractionConfig {
    // ... existing spec-based config ...

    // Optional enhancements (NOT spec-based)
    pub enable_word_splitting: bool,  // Default: false
}

Impact: ✅ +3 word fusions fixed when enabled Effort: 2-3 hours

2.2 Document Type Detection & Profiles

Location: src/extractors/gap_statistics.rs:154-248 (configuration), various .policy_documents(), .academic() methods Current State: Active, controlling 1,623 spurious spaces Challenge: Removing this LOWERS quality. Need to decide:

Option A: Delete (pure spec-only, quality drops to 3.5/10)
Option B: Keep but annotate as "empirical heuristic" (current approach)
Option C: Move to optional module with better documentation

Recommendation: Option B (Keep with Annotations) - for now

Add comments to all doc-type profiles explaining they're non-spec
Create config flag: use_adaptive_thresholds: bool (default: true)
Document why: "Empirical tuning for real-world PDFs"
Create variant: spec_strict_config() that disables all adaptive features
Later: Can move to optional module after implementing better spec-based solution

Example Documentation:

/// **NON-SPEC HEURISTIC**: Document-type-specific thresholds
/// These multipliers (1.3x for policy, 1.6x for academic) are empirically chosen
/// and NOT derived from ISO 32000-1:2008. They improve practical quality but
/// reduce spec compliance. Disable via: config.use_adaptive_thresholds = false
pub fn policy_documents() -> Self {
    Self {
        median_multiplier: 1.3,  // Tight spacing in policy docs
        // ...
    }
}

Impact: Maintains current quality until we find better approach Effort: 1-2 hours (annotation only)

2.3 Column Detection & Layout Analysis

Location: src/layout/document_analyzer.rs:118-408 (bin sizes, gap ratios, Gaussian sigma) Current State: Active, used for adaptive layout Decision: KEEP but separate into "layout enhancement" module Action:

Move to new module: src/enhancements/layout_analysis.rs
Mark all magic numbers with sources (ICDAR paper reference)
Add config flag: enable_layout_analysis: bool (default: true)
Document: "Uses ICDAR 2005 layout algorithm, not PDF spec-based"
Keep in extractors but with clear separation

Configuration:

pub struct LayoutAnalysisConfig {
    pub enabled: bool,
    // Bin width for projection profile (ICDAR algorithm)
    pub histogram_bin_width_pt: f32,  // default: 10.0
    // ... other ICDAR parameters ...
}

Impact: Maintains layout analysis, improves documentation Effort: 2-3 hours

PHASE 3: Annotation (High Priority - Quick Wins)

Add clear documentation to all non-spec code that stays.

3.1 Tag All Non-Spec Code

Action:

Find all non-spec implementations (use analysis output)
Add comment block:

/// **NON-SPEC HEURISTIC**
///
/// This feature is NOT defined in ISO 32000-1:2008.
///
/// Reason: [why we do this despite not being in spec]
/// Source: [paper/empirical/pdf-specific]
/// Status: [enabled by default | optional | deprecated]
///
/// To disable: [config flag or how]
/// Impact on quality: [what happens if disabled]

Locations to annotate:

gap_statistics.rs: All multiplier-based thresholds
geometric_spacing.rs: Document the 0.25em ratio choice
document_analyzer.rs: All ICDAR algorithm parameters
column_detector.rs: XY-Cut algorithm parameters
bold_validation.rs: Unicode whitespace handling

Effort: 3-4 hours

3.2 Create Spec Compliance Reference

New file: docs/SPEC_COMPLIANCE_GUIDE.md Content:

List all PDF spec sections used (9.3, 9.4.3, 9.4.4, etc.)
List all non-spec features and justifications
Configuration guide: How to enable/disable features
Quality vs. Compliance trade-offs
Comparison with pdfplumber, pdfminer.six

Effort: 2-3 hours

PHASE 4: Create Optional Enhancement Layers

4.1 Post-Processor Framework

New file: src/post_processors/mod.rs Purpose: Apply non-spec fixes AFTER spec-compliant extraction

pub trait PostProcessor {
    fn process(&self, document: &mut ExtractedDocument) -> Result<()>;
}

pub struct TextRepairProcessor {
    pub split_camelcase: bool,
    pub fix_empty_markers: bool,
    // ...
}

pub fn apply_post_processors(
    document: &mut ExtractedDocument,
    config: &PostProcessorConfig,
) -> Result<()> {
    if config.word_splitting.enabled {
        TextRepairProcessor::split_fused_words(document)?;
    }
    if config.bold_validation.enabled {
        BoldMarkerValidator::fix_empty_markers(document)?;
    }
    // ...
}

Effort: 3-4 hours

PHASE 5: Create Spec-Strict Mode

New configuration: TextExtractionConfig::spec_strict()

impl TextExtractionConfig {
    /// Returns configuration that ONLY uses PDF spec features
    /// - TJ array offsets (Section 9.4.3)
    /// - Boundary whitespace (Section 9.4.3)
    /// - Geometric gaps with fixed 0.25em threshold (Section 9.4.4)
    /// - Font metrics (Section 9.3)
    pub fn spec_strict() -> Self {
        Self {
            // Core spec features
            use_tj_offsets: true,
            use_geometric_gaps: true,
            use_boundary_whitespace: true,

            // Disable ALL non-spec features
            use_adaptive_thresholds: false,
            enable_word_splitting: false,
            enable_layout_analysis: false,
            enable_table_detection: false,
            enable_heading_detection: false,

            // Fixed thresholds (from pdfplumber)
            geometric_gap_threshold_em: 0.25,  // Standard 0.25em
            ..Default::default()
        }
    }
}

Testing:

Add test: test_spec_strict_mode_disabled()
Run regression suite with spec_strict()
Expected: Lower quality (3.5-4.5/10) but spec-compliant

Effort: 1-2 hours

Execution Order (Recommended)

Stage 1: Quick Removals + Annotations

Phase 1.1: Delete table_detector.rs (30 min)
Phase 1.2: Delete heading_detector.rs or annotate (1-2 hrs)
Phase 1.3: Delete ML classifier (30 min)
Phase 3.1: Annotate all non-spec code (3-4 hrs)
Total: ~6-8 hours → Immediate clarity on what's non-spec

Stage 2: Documentation + Refactoring

Phase 3.2: Create spec compliance guide (2-3 hrs)
Phase 2.1: Move CamelCase to post-processor (2-3 hrs)
Phase 2.3: Move layout analysis to enhancement module (2-3 hrs)
Total: ~6-9 hours → Clean separation of concerns

Stage 3: Framework + Testing

Phase 4.1: Create post-processor framework (3-4 hrs)
Phase 5: Create spec-strict mode (1-2 hrs)
Testing: Regression suite + quality metrics (2-3 hrs)
Total: ~6-9 hours → Production-ready clean architecture

File Structure After Cleanup

src/
├── core/                          # SPEC-COMPLIANT ONLY
│   ├── text_extraction.rs         # Core text extraction (TJ, boundaries, gaps)
│   ├── geometric_spacing.rs       # Fixed 0.25em threshold (CURRENT geometric_spacing.rs)
│   └── font_metrics.rs            # Font state parameters (Tc, Tw, Th)
│
├── enhancements/                  # OPTIONAL, USER-CONTROLLED
│   ├── adaptive_thresholds.rs     # Gap statistics multipliers (from gap_statistics.rs)
│   ├── layout_analysis.rs         # Document analysis, column detection (ICDAR-based)
│   └── config.rs                  # Unified enhancement configuration
│
├── post_processors/               # APPLIED AFTER EXTRACTION (NON-SPEC)
│   ├── mod.rs                     # PostProcessor trait
│   ├── word_splitter.rs           # CamelCase splitting (from split_fused_words)
│   ├── bold_validator.rs          # Empty bold marker fixes (moved from converters)
│   └── spurious_space_fixer.rs    # Fix double spaces (Issue #2)
│
├── converters/
│   └── markdown.rs                # Markdown output (use post-processors)
│
└── [other modules unchanged]

docs/
├── PHASE10_PDF_SPEC_COMPLIANCE.md          # Existing
├── CLEANUP_ROADMAP.md                      # This file
└── SPEC_COMPLIANCE_GUIDE.md                # New - Comprehensive guide

Quality & Compliance Matrix

Config Mode	Word Fusions	Spurious Spaces	Empty Bold	Quality	Spec Compliant
spec_strict	❌ 3	✅ 0	❌ 2-3	3.5/10	✅ 100%
default	❌ 3	✅ 0	❌ 2-3	4.4/10	🟡 70%
with_enhancements	❌ 3	✅ 0	❌ 2-3	6.5/10	🟡 50%
with_all_fixes	✅ 0	✅ 0	✅ 0	8.5/10	🟡 40%

Critical Notes

What to KEEP (with justification)

Geometric spacing 0.25em threshold ✅
- Justified: pdfplumber standard, widely proven
- Spec: Section 9.4.4 supports this interpretation
- Config: Fixed (not adaptive)
Boundary whitespace detection ✅
- Justified: Directly in PDF spec (Section 9.4.3)
- Spec: "Spaces in text strings"
- Config: No option (always on)
TJ offset signals ✅
- Justified: Directly in PDF spec (Section 9.4.3)
- Spec: "TJ array offsets determine positioning"
- Config: No option (always on)
Bold/italic detection from font flags ✅
- Justified: Font properties in PDF spec (Section 5.3.3)
- Spec: Font.Flags, Font.FontWeight etc.
- Config: Always on (core feature)

What to REMOVE (no spec justification)

❌ Table detection (move to optional)
❌ Heading detection heuristics (move to optional)
❌ ML classifiers (delete)
❌ CamelCase splitting (move to post-processor)
❌ Document-type profiles (annotate as heuristic)

What to ANNOTATE (keep but document)

📝 Adaptive gap multipliers (empirical, non-spec)
📝 ICDAR layout analysis (academic, non-spec)
📝 Unicode whitespace handling (PDF-specific workaround)

Success Criteria

All non-spec code clearly marked with NON-SPEC HEURISTIC comments
New modules: core/, enhancements/, post_processors/
spec_strict() configuration works (3.5/10 quality, 100% compliant)
Default configuration improved (4.4→5.0/10, ~70% compliant)
All fixes as optional post-processors (8.5/10, ~40% compliant but user-controlled)
Comprehensive spec compliance guide published
Regression suite passes for all configurations
Clear user documentation: When to enable/disable features

Commands for Testing

# Test spec-strict mode
cargo test --test quality_metrics -- --spec-strict

# Test with all enhancements
cargo test --test quality_metrics -- --enable-all

# Test post-processors
cargo test --test quality_metrics -- --with-post-processors

# Full regression suite
cargo test --test regression_suite

Questions Before Starting

Question 1: Should we delete table detection entirely, or keep it but move to optional module?
- Recommended: Delete (false positives, users have structure tree)
Question 2: For adaptive gaps, should we move to enhancements/ or keep in core?
- Recommended: Keep in core but annotate heavily (needed for current quality)
Question 3: Should spec_strict_mode() be the default or opt-in?
- Recommended: Opt-in (users expect good quality by default)

Timeline

Phase 1-2: 8 hours → Remove/migrate non-spec code
Phase 3-4: 8 hours → Create framework + documentation
Phase 5: 3 hours → Testing + verification
Total: ~19 hours → Production-ready clean architecture

Should we start with Phase 1 (quick removals)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF Spec Compliance Cleanup Roadmap

PHASE 1: Removal (Lowest Priority - Just Delete)

1.1 Table Detection Module

1.2 Heading Detection Heuristics

1.3 ML Heading Classifier

PHASE 2: Migration to Optional (Medium Priority)

2.1 CamelCase Word Splitting

2.2 Document Type Detection & Profiles

2.3 Column Detection & Layout Analysis

PHASE 3: Annotation (High Priority - Quick Wins)

3.1 Tag All Non-Spec Code

3.2 Create Spec Compliance Reference

PHASE 4: Create Optional Enhancement Layers

4.1 Post-Processor Framework

PHASE 5: Create Spec-Strict Mode

Execution Order (Recommended)

Stage 1: Quick Removals + Annotations

Stage 2: Documentation + Refactoring

Stage 3: Framework + Testing

File Structure After Cleanup

Quality & Compliance Matrix

Critical Notes

What to KEEP (with justification)

What to REMOVE (no spec justification)

What to ANNOTATE (keep but document)

Success Criteria

Commands for Testing

Questions Before Starting

Timeline

FilesExpand file tree

REFACTORING_PLAN.md

Latest commit

History

REFACTORING_PLAN.md

File metadata and controls

PDF Spec Compliance Cleanup Roadmap

PHASE 1: Removal (Lowest Priority - Just Delete)

1.1 Table Detection Module

1.2 Heading Detection Heuristics

1.3 ML Heading Classifier

PHASE 2: Migration to Optional (Medium Priority)

2.1 CamelCase Word Splitting

2.2 Document Type Detection & Profiles

2.3 Column Detection & Layout Analysis

PHASE 3: Annotation (High Priority - Quick Wins)

3.1 Tag All Non-Spec Code

3.2 Create Spec Compliance Reference

PHASE 4: Create Optional Enhancement Layers

4.1 Post-Processor Framework

PHASE 5: Create Spec-Strict Mode

Execution Order (Recommended)

Stage 1: Quick Removals + Annotations

Stage 2: Documentation + Refactoring

Stage 3: Framework + Testing

File Structure After Cleanup

Quality & Compliance Matrix

Critical Notes

What to KEEP (with justification)

What to REMOVE (no spec justification)

What to ANNOTATE (keep but document)

Success Criteria

Commands for Testing

Questions Before Starting

Timeline