Skip to content

RAG - Structure Extraction #197

@zaira-bibi

Description

@zaira-bibi

Document structure extraction (PDF/DOCX/HTML → Markdown)

Goal
Extract structured Markdown with headings preserved.

Responsibilities

  • Convert source formats → Markdown
  • Preserve hierarchy (#, ##, ###, etc.)
  • Normalize headings across formats
  • Produce a single structured Markdown document
  • Emit warnings for elements that may not render cleanly (e.g. tables)
  • Unit tests for the functionality

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions