Advanced Topics

Overview

This section covers advanced topics for extending and optimizing docling-graph. These guides are for users who need to:

Create custom extraction backends
Build custom exporters
Add pipeline stages
Optimize performance
Handle errors gracefully
Test templates and pipelines

Topics

🧩 Extensibility

Custom Backends
Create custom extraction backends for specialized models or APIs.

Implement backend protocols
VLM backend example
LLM backend example
Integration with pipeline

Custom Exporters
Build custom exporters for specialized output formats.

Implement exporter protocol
Graph data access
Custom format generation
Registration and usage

Custom Stages
Add custom stages to the pipeline for specialized processing.

Pipeline stage protocol
Stage implementation
Context management
Error handling

📐 Optimization

Performance Tuning
Optimize extraction speed and resource usage.

Model selection strategies
Batch size optimization
Memory management
GPU utilization
Caching strategies

🛡️ Reliability

Error Handling
Handle errors gracefully and implement retry logic.

Exception hierarchy
Error recovery strategies
Logging and debugging
Retry mechanisms

Testing
Test templates, backends, and pipelines.

Template validation
Mock backends
Integration testing
CI/CD integration

Prerequisites

Before diving into advanced topics, ensure you understand:

Schema Definition - Pydantic templates
Pipeline Configuration - Configuration options
Extraction Process - How extraction works
Python API - Programmatic usage

When to Use Advanced Features

Custom Backends

Use when:
✅ You have a specialized model not supported by default
✅ You need to integrate with a proprietary API
✅ You want to implement custom preprocessing
✅ You need fine-grained control over extraction

Don't use when:
❌ Default backends meet your needs
❌ You're just starting with docling-graph
❌ You don't need custom logic

Custom Exporters

Use when:
✅ You need a specialized output format
✅ You're integrating with a specific database
✅ You need custom data transformations
✅ Default formats don't meet requirements

Don't use when:
❌ CSV, Cypher, or JSON formats work
❌ You can post-process existing exports
❌ You're prototyping

Custom Stages

Use when:
✅ You need custom preprocessing
✅ You want to add validation steps
✅ You need custom post-processing
✅ You're building a specialized pipeline

Don't use when:
❌ Default pipeline stages suffice
❌ You can achieve goals with configuration
❌ You're learning the system

Architecture

Extension Points

--8<-- "docs/assets/flowcharts/extension_points.md"

Extension Points:

Custom Backends (blue): Replace extraction logic
Custom Exporters (blue): Replace export logic
Custom Stages (yellow): Add processing steps

Code Organization

Project Structure for Extensions

my_project/
├── templates/              # Pydantic templates
│   └── my_template.py
├── backends/               # Custom backends
│   ├── __init__.py
│   └── my_backend.py
├── exporters/              # Custom exporters
│   ├── __init__.py
│   └── my_exporter.py
├── stages/                 # Custom stages
│   ├── __init__.py
│   └── my_stage.py
├── tests/                  # Tests
│   ├── test_backend.py
│   ├── test_exporter.py
│   └── test_stage.py
└── main.py                 # Entry point

Development Workflow

1. Design

# Define interface
from docling_graph.protocols import TextExtractionBackendProtocol

class MyBackend(TextExtractionBackendProtocol):
    """Custom backend implementation."""
    pass

2. Implement

# Implement methods
def extract_from_markdown(self, markdown: str, template, context="", is_partial=False):
    """Extract structured data."""
    # Your logic here
    pass

3. Test

# Write tests
def test_my_backend():
    backend = MyBackend()
    result = backend.extract_from_markdown("test", MyTemplate)
    assert result is not None

4. Integrate

# Use in pipeline
from docling_graph import PipelineConfig
from my_backends import MyBackend

config = PipelineConfig(
    source="document.pdf",
    template="templates.MyTemplate",
    # Custom backend integration
)

Best Practices

👍 Follow Protocols

# ✅ Good - Implement protocol
from docling_graph.protocols import TextExtractionBackendProtocol

class MyBackend(TextExtractionBackendProtocol):
    def extract_from_markdown(self, ...): ...
    def consolidate_from_pydantic_models(self, ...): ...
    def cleanup(self): ...

# ❌ Avoid - Custom interface
class MyBackend:
    def my_custom_method(self, ...): ...

👍 Handle Errors

# ✅ Good - Use docling-graph exceptions
from docling_graph.exceptions import ExtractionError

def extract(self, ...):
    try:
        result = self._process()
        return result
    except Exception as e:
        raise ExtractionError(
            "Extraction failed",
            details={"source": source},
            cause=e
        )

# ❌ Avoid - Generic exceptions
def extract(self, ...):
    raise Exception("Something went wrong")

👍 Write Tests

# ✅ Good - Comprehensive tests
def test_backend_success():
    """Test successful extraction."""
    pass

def test_backend_failure():
    """Test error handling."""
    pass

def test_backend_cleanup():
    """Test resource cleanup."""
    pass

# ❌ Avoid - No tests
# (No tests written)

👍 Document Code

# ✅ Good - Clear documentation
class MyBackend:
    """
    Custom backend for specialized extraction.
    
    This backend uses a proprietary model to extract
    structured data from documents.
    
    Args:
        api_key: API key for the service
        model: Model name to use
        
    Example:
        >>> backend = MyBackend(api_key="key", model="model-v1")
        >>> result = backend.extract_from_markdown(text, Template)
    """
    pass

# ❌ Avoid - No documentation
class MyBackend:
    pass

Performance Considerations

Memory Management

# ✅ Good - Clean up resources
class MyBackend:
    def cleanup(self):
        """Release resources."""
        if hasattr(self, 'model'):
            del self.model
        if hasattr(self, 'client'):
            self.client.close()

# ❌ Avoid - Memory leaks
class MyBackend:
    def cleanup(self):
        pass  # Resources not released

Batch Processing

# ✅ Good - Process in batches
def process_documents(docs):
    batch_size = 10
    for i in range(0, len(docs), batch_size):
        batch = docs[i:i+batch_size]
        process_batch(batch)

# ❌ Avoid - Process all at once
def process_documents(docs):
    process_all(docs)  # May run out of memory

Security Considerations

API Keys

# ✅ Good - Use environment variables
import os

api_key = os.getenv("MY_API_KEY")
if not api_key:
    raise ValueError("MY_API_KEY not set")

# ❌ Avoid - Hardcoded keys
api_key = "sk-1234567890"  # Never do this!

Input Validation

# ✅ Good - Validate inputs
def extract(self, markdown: str, template):
    if not markdown:
        raise ValueError("Markdown cannot be empty")
    if not template:
        raise ValueError("Template is required")
    # Process...

# ❌ Avoid - No validation
def extract(self, markdown, template):
    # Process without checks
    pass

Next Steps

Choose a topic based on your needs:

Custom Backends → - Extend extraction capabilities
Custom Exporters → - Create custom output formats
Custom Stages → - Add pipeline stages
Performance Tuning → - Optimize performance
Error Handling → - Handle errors gracefully
Testing → - Test your extensions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advanced Topics

Overview

Topics

🧩 Extensibility

📐 Optimization

🛡️ Reliability

Prerequisites

When to Use Advanced Features

Custom Backends

Custom Exporters

Custom Stages

Architecture

Extension Points

Code Organization

Project Structure for Extensions

Development Workflow

1. Design

2. Implement

3. Test

4. Integrate

Best Practices

👍 Follow Protocols

👍 Handle Errors

👍 Write Tests

👍 Document Code

Performance Considerations

Memory Management

Batch Processing

Security Considerations

API Keys

Input Validation

Next Steps

FilesExpand file tree

index.md

Latest commit

History

index.md

File metadata and controls

Advanced Topics

Overview

Topics

🧩 Extensibility

📐 Optimization

🛡️ Reliability

Prerequisites

When to Use Advanced Features

Custom Backends

Custom Exporters

Custom Stages

Architecture

Extension Points

Code Organization

Project Structure for Extensions

Development Workflow

1. Design

2. Implement

3. Test

4. Integrate

Best Practices

👍 Follow Protocols

👍 Handle Errors

👍 Write Tests

👍 Document Code

Performance Considerations

Memory Management

Batch Processing

Security Considerations

API Keys

Input Validation

Next Steps