Skip to content

Latest commit

 

History

History
434 lines (340 loc) · 14.7 KB

File metadata and controls

434 lines (340 loc) · 14.7 KB

PDF Spec Compliance Cleanup Roadmap

Goal: Remove all non-PDF-spec-compliant code from core extraction. Reorganize into:

  • Core: Spec-compliant extraction only
  • Enhancements: Optional, user-controlled features
  • Clear boundary: No mixed concerns

Current State: ~2,000+ lines of non-spec code across 12 areas (Phase 10 status: 4.4/10) Target State: Spec-strict core + optional enhancement layers (8.5+/10)


PHASE 1: Removal (Lowest Priority - Just Delete)

These features have NO DEPENDENCY from core extraction. Safe to remove entirely.

1.1 Table Detection Module

Location: src/layout/table_detector.rs (300+ lines) Status: Can be removed immediately Action:

  • Delete src/layout/table_detector.rs
  • Remove from src/layout/mod.rs exports
  • Remove from src/extractors/mod.rs if exported
  • Update any tests that reference table detection
  • Remove TableDetector, DetectedTable, TableDetectorConfig from public API

Why: Tables are semantic concepts NOT in PDF spec. Users should use structure tree (Section 14.7) if they want real table information.

Impact: ✅ No code breakage (optional feature) Effort: 30 minutes


1.2 Heading Detection Heuristics

Location: src/layout/heading_detector.rs (300+ lines) Status: Can be removed if not core to extraction Action:

  • Check if heading_detector is used in critical path
  • If yes: Keep but add spec_compliant: false flag
  • If no: Delete entire module
  • Remove hardcoded font size thresholds (22pt, 18pt, etc.)

Why: Font-based heading detection is linguistic interpretation, not PDF spec feature. Use structure tree for real heading info.

Impact: Depends on usage - potentially breaks heading detection Effort: 1 hour if deletable, 2 hours if keeping with annotations


1.3 ML Heading Classifier

Location: src/ml/heading_classifier.rs (200+ lines) Status: Remove or move to optional module Action:

  • Delete src/ml/heading_classifier.rs
  • Remove from src/ml/mod.rs
  • If ML module becomes empty, consider removing entire src/ml/
  • Remove all DistilBERT references from docs

Why: ML-based semantic analysis is antithetical to spec compliance. It's a proprietary classification layer.

Impact: ✅ No code breakage Effort: 30 minutes


PHASE 2: Migration to Optional (Medium Priority)

These are NEEDED for quality but NOT in PDF spec. Move to optional post-processing layer.

2.1 CamelCase Word Splitting

Location: src/extractors/text.rs:1467-1475, 2057-2141, 3671-3989 Current State: Already disabled but code still present Action:

  • Create new module: src/post_processors/word_splitter.rs
  • Move split_fused_words(), split_on_camelcase() to new module
  • Remove calls from main extraction pipeline
  • Add post-processing layer to text extraction with opt-in flag
  • Document: "Optional feature - not PDF spec-based"
  • Add unit tests: verify "theGeneral" → "the General" works
  • Remove dead code from text.rs: lines 3671-3989

Configuration:

pub struct TextExtractionConfig {
    // ... existing spec-based config ...

    // Optional enhancements (NOT spec-based)
    pub enable_word_splitting: bool,  // Default: false
}

Impact: ✅ +3 word fusions fixed when enabled Effort: 2-3 hours


2.2 Document Type Detection & Profiles

Location: src/extractors/gap_statistics.rs:154-248 (configuration), various .policy_documents(), .academic() methods Current State: Active, controlling 1,623 spurious spaces Challenge: Removing this LOWERS quality. Need to decide:

  • Option A: Delete (pure spec-only, quality drops to 3.5/10)
  • Option B: Keep but annotate as "empirical heuristic" (current approach)
  • Option C: Move to optional module with better documentation

Recommendation: Option B (Keep with Annotations) - for now

  • Add comments to all doc-type profiles explaining they're non-spec
  • Create config flag: use_adaptive_thresholds: bool (default: true)
  • Document why: "Empirical tuning for real-world PDFs"
  • Create variant: spec_strict_config() that disables all adaptive features
  • Later: Can move to optional module after implementing better spec-based solution

Example Documentation:

/// **NON-SPEC HEURISTIC**: Document-type-specific thresholds
/// These multipliers (1.3x for policy, 1.6x for academic) are empirically chosen
/// and NOT derived from ISO 32000-1:2008. They improve practical quality but
/// reduce spec compliance. Disable via: config.use_adaptive_thresholds = false
pub fn policy_documents() -> Self {
    Self {
        median_multiplier: 1.3,  // Tight spacing in policy docs
        // ...
    }
}

Impact: Maintains current quality until we find better approach Effort: 1-2 hours (annotation only)


2.3 Column Detection & Layout Analysis

Location: src/layout/document_analyzer.rs:118-408 (bin sizes, gap ratios, Gaussian sigma) Current State: Active, used for adaptive layout Decision: KEEP but separate into "layout enhancement" module Action:

  • Move to new module: src/enhancements/layout_analysis.rs
  • Mark all magic numbers with sources (ICDAR paper reference)
  • Add config flag: enable_layout_analysis: bool (default: true)
  • Document: "Uses ICDAR 2005 layout algorithm, not PDF spec-based"
  • Keep in extractors but with clear separation

Configuration:

pub struct LayoutAnalysisConfig {
    pub enabled: bool,
    // Bin width for projection profile (ICDAR algorithm)
    pub histogram_bin_width_pt: f32,  // default: 10.0
    // ... other ICDAR parameters ...
}

Impact: Maintains layout analysis, improves documentation Effort: 2-3 hours


PHASE 3: Annotation (High Priority - Quick Wins)

Add clear documentation to all non-spec code that stays.

3.1 Tag All Non-Spec Code

Action:

  • Find all non-spec implementations (use analysis output)
  • Add comment block:
/// **NON-SPEC HEURISTIC**
///
/// This feature is NOT defined in ISO 32000-1:2008.
///
/// Reason: [why we do this despite not being in spec]
/// Source: [paper/empirical/pdf-specific]
/// Status: [enabled by default | optional | deprecated]
///
/// To disable: [config flag or how]
/// Impact on quality: [what happens if disabled]

Locations to annotate:

  • gap_statistics.rs: All multiplier-based thresholds
  • geometric_spacing.rs: Document the 0.25em ratio choice
  • document_analyzer.rs: All ICDAR algorithm parameters
  • column_detector.rs: XY-Cut algorithm parameters
  • bold_validation.rs: Unicode whitespace handling

Effort: 3-4 hours


3.2 Create Spec Compliance Reference

New file: docs/SPEC_COMPLIANCE_GUIDE.md Content:

  • List all PDF spec sections used (9.3, 9.4.3, 9.4.4, etc.)
  • List all non-spec features and justifications
  • Configuration guide: How to enable/disable features
  • Quality vs. Compliance trade-offs
  • Comparison with pdfplumber, pdfminer.six

Effort: 2-3 hours


PHASE 4: Create Optional Enhancement Layers

4.1 Post-Processor Framework

New file: src/post_processors/mod.rs Purpose: Apply non-spec fixes AFTER spec-compliant extraction

pub trait PostProcessor {
    fn process(&self, document: &mut ExtractedDocument) -> Result<()>;
}

pub struct TextRepairProcessor {
    pub split_camelcase: bool,
    pub fix_empty_markers: bool,
    // ...
}

pub fn apply_post_processors(
    document: &mut ExtractedDocument,
    config: &PostProcessorConfig,
) -> Result<()> {
    if config.word_splitting.enabled {
        TextRepairProcessor::split_fused_words(document)?;
    }
    if config.bold_validation.enabled {
        BoldMarkerValidator::fix_empty_markers(document)?;
    }
    // ...
}

Effort: 3-4 hours


PHASE 5: Create Spec-Strict Mode

New configuration: TextExtractionConfig::spec_strict()

impl TextExtractionConfig {
    /// Returns configuration that ONLY uses PDF spec features
    /// - TJ array offsets (Section 9.4.3)
    /// - Boundary whitespace (Section 9.4.3)
    /// - Geometric gaps with fixed 0.25em threshold (Section 9.4.4)
    /// - Font metrics (Section 9.3)
    pub fn spec_strict() -> Self {
        Self {
            // Core spec features
            use_tj_offsets: true,
            use_geometric_gaps: true,
            use_boundary_whitespace: true,

            // Disable ALL non-spec features
            use_adaptive_thresholds: false,
            enable_word_splitting: false,
            enable_layout_analysis: false,
            enable_table_detection: false,
            enable_heading_detection: false,

            // Fixed thresholds (from pdfplumber)
            geometric_gap_threshold_em: 0.25,  // Standard 0.25em
            ..Default::default()
        }
    }
}

Testing:

  • Add test: test_spec_strict_mode_disabled()
  • Run regression suite with spec_strict()
  • Expected: Lower quality (3.5-4.5/10) but spec-compliant

Effort: 1-2 hours


Execution Order (Recommended)

Stage 1: Quick Removals + Annotations

  1. Phase 1.1: Delete table_detector.rs (30 min)
  2. Phase 1.2: Delete heading_detector.rs or annotate (1-2 hrs)
  3. Phase 1.3: Delete ML classifier (30 min)
  4. Phase 3.1: Annotate all non-spec code (3-4 hrs)
  5. Total: ~6-8 hours → Immediate clarity on what's non-spec

Stage 2: Documentation + Refactoring

  1. Phase 3.2: Create spec compliance guide (2-3 hrs)
  2. Phase 2.1: Move CamelCase to post-processor (2-3 hrs)
  3. Phase 2.3: Move layout analysis to enhancement module (2-3 hrs)
  4. Total: ~6-9 hours → Clean separation of concerns

Stage 3: Framework + Testing

  1. Phase 4.1: Create post-processor framework (3-4 hrs)
  2. Phase 5: Create spec-strict mode (1-2 hrs)
  3. Testing: Regression suite + quality metrics (2-3 hrs)
  4. Total: ~6-9 hours → Production-ready clean architecture

File Structure After Cleanup

src/
├── core/                          # SPEC-COMPLIANT ONLY
│   ├── text_extraction.rs         # Core text extraction (TJ, boundaries, gaps)
│   ├── geometric_spacing.rs       # Fixed 0.25em threshold (CURRENT geometric_spacing.rs)
│   └── font_metrics.rs            # Font state parameters (Tc, Tw, Th)
│
├── enhancements/                  # OPTIONAL, USER-CONTROLLED
│   ├── adaptive_thresholds.rs     # Gap statistics multipliers (from gap_statistics.rs)
│   ├── layout_analysis.rs         # Document analysis, column detection (ICDAR-based)
│   └── config.rs                  # Unified enhancement configuration
│
├── post_processors/               # APPLIED AFTER EXTRACTION (NON-SPEC)
│   ├── mod.rs                     # PostProcessor trait
│   ├── word_splitter.rs           # CamelCase splitting (from split_fused_words)
│   ├── bold_validator.rs          # Empty bold marker fixes (moved from converters)
│   └── spurious_space_fixer.rs    # Fix double spaces (Issue #2)
│
├── converters/
│   └── markdown.rs                # Markdown output (use post-processors)
│
└── [other modules unchanged]

docs/
├── PHASE10_PDF_SPEC_COMPLIANCE.md          # Existing
├── CLEANUP_ROADMAP.md                      # This file
└── SPEC_COMPLIANCE_GUIDE.md                # New - Comprehensive guide

Quality & Compliance Matrix

Config Mode Word Fusions Spurious Spaces Empty Bold Quality Spec Compliant
spec_strict ❌ 3 ✅ 0 ❌ 2-3 3.5/10 ✅ 100%
default ❌ 3 ✅ 0 ❌ 2-3 4.4/10 🟡 70%
with_enhancements ❌ 3 ✅ 0 ❌ 2-3 6.5/10 🟡 50%
with_all_fixes ✅ 0 ✅ 0 ✅ 0 8.5/10 🟡 40%

Critical Notes

What to KEEP (with justification)

  1. Geometric spacing 0.25em threshold

    • Justified: pdfplumber standard, widely proven
    • Spec: Section 9.4.4 supports this interpretation
    • Config: Fixed (not adaptive)
  2. Boundary whitespace detection

    • Justified: Directly in PDF spec (Section 9.4.3)
    • Spec: "Spaces in text strings"
    • Config: No option (always on)
  3. TJ offset signals

    • Justified: Directly in PDF spec (Section 9.4.3)
    • Spec: "TJ array offsets determine positioning"
    • Config: No option (always on)
  4. Bold/italic detection from font flags

    • Justified: Font properties in PDF spec (Section 5.3.3)
    • Spec: Font.Flags, Font.FontWeight etc.
    • Config: Always on (core feature)

What to REMOVE (no spec justification)

  1. ❌ Table detection (move to optional)
  2. ❌ Heading detection heuristics (move to optional)
  3. ❌ ML classifiers (delete)
  4. ❌ CamelCase splitting (move to post-processor)
  5. ❌ Document-type profiles (annotate as heuristic)

What to ANNOTATE (keep but document)

  1. 📝 Adaptive gap multipliers (empirical, non-spec)
  2. 📝 ICDAR layout analysis (academic, non-spec)
  3. 📝 Unicode whitespace handling (PDF-specific workaround)

Success Criteria

  • All non-spec code clearly marked with NON-SPEC HEURISTIC comments
  • New modules: core/, enhancements/, post_processors/
  • spec_strict() configuration works (3.5/10 quality, 100% compliant)
  • Default configuration improved (4.4→5.0/10, ~70% compliant)
  • All fixes as optional post-processors (8.5/10, ~40% compliant but user-controlled)
  • Comprehensive spec compliance guide published
  • Regression suite passes for all configurations
  • Clear user documentation: When to enable/disable features

Commands for Testing

# Test spec-strict mode
cargo test --test quality_metrics -- --spec-strict

# Test with all enhancements
cargo test --test quality_metrics -- --enable-all

# Test post-processors
cargo test --test quality_metrics -- --with-post-processors

# Full regression suite
cargo test --test regression_suite

Questions Before Starting

  1. Question 1: Should we delete table detection entirely, or keep it but move to optional module?

    • Recommended: Delete (false positives, users have structure tree)
  2. Question 2: For adaptive gaps, should we move to enhancements/ or keep in core?

    • Recommended: Keep in core but annotate heavily (needed for current quality)
  3. Question 3: Should spec_strict_mode() be the default or opt-in?

    • Recommended: Opt-in (users expect good quality by default)

Timeline

  • Phase 1-2: 8 hours → Remove/migrate non-spec code
  • Phase 3-4: 8 hours → Create framework + documentation
  • Phase 5: 3 hours → Testing + verification
  • Total: ~19 hours → Production-ready clean architecture

Should we start with Phase 1 (quick removals)?