Skip to content

refactor(extract): Modularize extract package for testability #12

@prosdev

Description

@prosdev

Context

The packages/extract/src/index.ts is a 386-line monolith containing:

  • Zod schemas
  • MIME type detection
  • PDF to image conversion
  • OCR processing
  • Gemini provider
  • Ollama provider
  • Streaming logic

This makes unit testing difficult - we cannot test OCR, PDF conversion, or providers in isolation.

Architecture Decisions

  • Package: packages/extract
  • Pattern: Module decomposition with barrel export
  • Goal: Each module testable independently

Proposed Structure

packages/extract/src/
├── index.ts              # Barrel export only
├── schemas.ts            # Zod schemas
├── mime.ts               # getMimeType()
├── pdf.ts                # pdfToImages()
├── ocr.ts                # ocrImages() + types
├── extract.ts            # extractDocument() orchestrator
├── providers/
│   ├── gemini.ts         # extractWithGemini()
│   └── ollama.ts         # extractWithOllama()
└── types.ts              # StreamChunk, StreamCallback, ExtractOptions

Requirements

  • Extract Zod schemas to schemas.ts
  • Extract MIME detection to mime.ts
  • Extract PDF conversion to pdf.ts
  • Extract OCR logic to ocr.ts
  • Extract Gemini provider to providers/gemini.ts
  • Extract Ollama provider to providers/ollama.ts
  • Create shared types in types.ts
  • Create barrel export in index.ts
  • Update tests to use new module structure
  • Maintain 100% backward compatibility

Success Criteria

  • All existing tests pass
  • Each module can be imported/tested independently
  • No breaking changes to public API
  • Coverage maintained or improved

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions