Thank you for your interest in contributing to Contextifier! This document provides guidelines and instructions for contributing.
git clone https://github.com/your-org/contextifier.git
cd contextifier
python -m venv .venv
source .venv/bin/activate # Linux/Mac
.venv\Scripts\activate # Windows
pip install -e ".[dev]"contextifier_new/ # v2 main package
├── document_processor.py # Facade (public API)
├── config.py # ProcessingConfig
├── types.py # Shared types
├── errors.py # Exception hierarchy
├── handlers/ # 14 format handlers
├── pipeline/ # 5-Stage ABCs
├── services/ # Shared services
├── chunking/ # Chunking subsystem
└── ocr/ # OCR subsystem
- Python 3.12+ syntax
- Type hints required on all public APIs
- Docstrings required (Google style)
from __future__ import annotationsat the top of every module
-
All handlers must follow the 5-stage pipeline:
Converter→Preprocessor→MetadataExtractor→ContentExtractor→PostprocessorBaseHandler.process()enforces execution order — implement each stage only.
-
Do not create services directly:
TagService,ImageService, etc. are created byDocumentProcessorand injected.- Handlers access them via
self._services["tag_service"], etc.
-
Pass all settings through
ProcessingConfig:- No hardcoded magic numbers.
- If you need a new setting, add a field to the appropriate
*Configclass.
-
Respect the Facade pattern:
- The only user-facing API is
DocumentProcessor. - Do not instruct users to import internal modules directly (OCR engines excepted).
- The only user-facing API is
contextifier_new/handlers/myformat/
├── __init__.py
├── converter.py
├── preprocessor.py
├── metadata_extractor.py
├── content_extractor.py
└── postprocessor.py
# converter.py
from contextifier_new.pipeline.converter import BaseConverter
class MyFormatConverter(BaseConverter):
def convert(self, file_context, **kwargs):
# Binary → Format-specific object
return parsed_objectAdd to contextifier_new/handlers/registry.py in register_defaults():
from contextifier_new.handlers.myformat import MyFormatHandler
self.register(MyFormatHandler, extensions=["myf", "myformat"])feat: add new feature
fix: bug fix
docs: documentation changes
refactor: refactoring (no behavior change)
test: add/modify tests
chore: build/config changes
Examples:
feat(handler): add EPUB handler with full pipeline
fix(chunking): preserve table structure in protected strategy
docs: update QUICKSTART with batch processing example
- Create a feature branch from
main - Implement changes and test
- Include rationale and test results in PR description
- Squash merge after review
When reporting a bug, please include:
- Python version
- OS and version
- Input file format and size
- Full error message
- Reproduction code (if possible)
All contributions are released under the project's Apache License 2.0.