Skip to content

Latest commit

 

History

History
391 lines (290 loc) · 8.51 KB

File metadata and controls

391 lines (290 loc) · 8.51 KB

Advanced Topics

Overview

This section covers advanced topics for extending and optimizing docling-graph. These guides are for users who need to:

  • Create custom extraction backends
  • Build custom exporters
  • Add pipeline stages
  • Optimize performance
  • Handle errors gracefully
  • Test templates and pipelines

Topics

🧩 Extensibility

Custom Backends
Create custom extraction backends for specialized models or APIs.

  • Implement backend protocols
  • VLM backend example
  • LLM backend example
  • Integration with pipeline

Custom Exporters
Build custom exporters for specialized output formats.

  • Implement exporter protocol
  • Graph data access
  • Custom format generation
  • Registration and usage

Custom Stages
Add custom stages to the pipeline for specialized processing.

  • Pipeline stage protocol
  • Stage implementation
  • Context management
  • Error handling

📐 Optimization

Performance Tuning
Optimize extraction speed and resource usage.

  • Model selection strategies
  • Batch size optimization
  • Memory management
  • GPU utilization
  • Caching strategies

🛡️ Reliability

Error Handling
Handle errors gracefully and implement retry logic.

  • Exception hierarchy
  • Error recovery strategies
  • Logging and debugging
  • Retry mechanisms

Testing
Test templates, backends, and pipelines.

  • Template validation
  • Mock backends
  • Integration testing
  • CI/CD integration

Prerequisites

Before diving into advanced topics, ensure you understand:

  1. Schema Definition - Pydantic templates
  2. Pipeline Configuration - Configuration options
  3. Extraction Process - How extraction works
  4. Python API - Programmatic usage

When to Use Advanced Features

Custom Backends

Use when:
✅ You have a specialized model not supported by default
✅ You need to integrate with a proprietary API
✅ You want to implement custom preprocessing
✅ You need fine-grained control over extraction

Don't use when:
❌ Default backends meet your needs
❌ You're just starting with docling-graph
❌ You don't need custom logic

Custom Exporters

Use when:
✅ You need a specialized output format
✅ You're integrating with a specific database
✅ You need custom data transformations
✅ Default formats don't meet requirements

Don't use when:
❌ CSV, Cypher, or JSON formats work
❌ You can post-process existing exports
❌ You're prototyping

Custom Stages

Use when:
✅ You need custom preprocessing
✅ You want to add validation steps
✅ You need custom post-processing
✅ You're building a specialized pipeline

Don't use when:
❌ Default pipeline stages suffice
❌ You can achieve goals with configuration
❌ You're learning the system


Architecture

Extension Points

--8<-- "docs/assets/flowcharts/extension_points.md"

Extension Points:

  • Custom Backends (blue): Replace extraction logic
  • Custom Exporters (blue): Replace export logic
  • Custom Stages (yellow): Add processing steps

Code Organization

Project Structure for Extensions

my_project/
├── templates/              # Pydantic templates
│   └── my_template.py
├── backends/               # Custom backends
│   ├── __init__.py
│   └── my_backend.py
├── exporters/              # Custom exporters
│   ├── __init__.py
│   └── my_exporter.py
├── stages/                 # Custom stages
│   ├── __init__.py
│   └── my_stage.py
├── tests/                  # Tests
│   ├── test_backend.py
│   ├── test_exporter.py
│   └── test_stage.py
└── main.py                 # Entry point

Development Workflow

1. Design

# Define interface
from docling_graph.protocols import TextExtractionBackendProtocol

class MyBackend(TextExtractionBackendProtocol):
    """Custom backend implementation."""
    pass

2. Implement

# Implement methods
def extract_from_markdown(self, markdown: str, template, context="", is_partial=False):
    """Extract structured data."""
    # Your logic here
    pass

3. Test

# Write tests
def test_my_backend():
    backend = MyBackend()
    result = backend.extract_from_markdown("test", MyTemplate)
    assert result is not None

4. Integrate

# Use in pipeline
from docling_graph import PipelineConfig
from my_backends import MyBackend

config = PipelineConfig(
    source="document.pdf",
    template="templates.MyTemplate",
    # Custom backend integration
)

Best Practices

👍 Follow Protocols

# ✅ Good - Implement protocol
from docling_graph.protocols import TextExtractionBackendProtocol

class MyBackend(TextExtractionBackendProtocol):
    def extract_from_markdown(self, ...): ...
    def consolidate_from_pydantic_models(self, ...): ...
    def cleanup(self): ...

# ❌ Avoid - Custom interface
class MyBackend:
    def my_custom_method(self, ...): ...

👍 Handle Errors

# ✅ Good - Use docling-graph exceptions
from docling_graph.exceptions import ExtractionError

def extract(self, ...):
    try:
        result = self._process()
        return result
    except Exception as e:
        raise ExtractionError(
            "Extraction failed",
            details={"source": source},
            cause=e
        )

# ❌ Avoid - Generic exceptions
def extract(self, ...):
    raise Exception("Something went wrong")

👍 Write Tests

# ✅ Good - Comprehensive tests
def test_backend_success():
    """Test successful extraction."""
    pass

def test_backend_failure():
    """Test error handling."""
    pass

def test_backend_cleanup():
    """Test resource cleanup."""
    pass

# ❌ Avoid - No tests
# (No tests written)

👍 Document Code

# ✅ Good - Clear documentation
class MyBackend:
    """
    Custom backend for specialized extraction.
    
    This backend uses a proprietary model to extract
    structured data from documents.
    
    Args:
        api_key: API key for the service
        model: Model name to use
        
    Example:
        >>> backend = MyBackend(api_key="key", model="model-v1")
        >>> result = backend.extract_from_markdown(text, Template)
    """
    pass

# ❌ Avoid - No documentation
class MyBackend:
    pass

Performance Considerations

Memory Management

# ✅ Good - Clean up resources
class MyBackend:
    def cleanup(self):
        """Release resources."""
        if hasattr(self, 'model'):
            del self.model
        if hasattr(self, 'client'):
            self.client.close()

# ❌ Avoid - Memory leaks
class MyBackend:
    def cleanup(self):
        pass  # Resources not released

Batch Processing

# ✅ Good - Process in batches
def process_documents(docs):
    batch_size = 10
    for i in range(0, len(docs), batch_size):
        batch = docs[i:i+batch_size]
        process_batch(batch)

# ❌ Avoid - Process all at once
def process_documents(docs):
    process_all(docs)  # May run out of memory

Security Considerations

API Keys

# ✅ Good - Use environment variables
import os

api_key = os.getenv("MY_API_KEY")
if not api_key:
    raise ValueError("MY_API_KEY not set")

# ❌ Avoid - Hardcoded keys
api_key = "sk-1234567890"  # Never do this!

Input Validation

# ✅ Good - Validate inputs
def extract(self, markdown: str, template):
    if not markdown:
        raise ValueError("Markdown cannot be empty")
    if not template:
        raise ValueError("Template is required")
    # Process...

# ❌ Avoid - No validation
def extract(self, markdown, template):
    # Process without checks
    pass

Next Steps

Choose a topic based on your needs:

  1. Custom Backends → - Extend extraction capabilities
  2. Custom Exporters → - Create custom output formats
  3. Custom Stages → - Add pipeline stages
  4. Performance Tuning → - Optimize performance
  5. Error Handling → - Handle errors gracefully
  6. Testing → - Test your extensions