Skip to content

feat(rag): implement document structure extraction layer#199

Open
zaira-bibi wants to merge 2 commits intomainfrom
zaira/rag-chunking
Open

feat(rag): implement document structure extraction layer#199
zaira-bibi wants to merge 2 commits intomainfrom
zaira/rag-chunking

Conversation

@zaira-bibi
Copy link
Contributor

@zaira-bibi zaira-bibi commented Mar 12, 2026

Supports PDF, DOCX, HTML and TXT/MD conversion to structured Markdown. Preserves heading hierarchy, lists, and tables across all formats.

Closes #197

@zaira-bibi zaira-bibi self-assigned this Mar 12, 2026
@claude
Copy link

claude bot commented Mar 12, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

Supports PDF, DOCX, HTML and TXT/MD conversion to structured Markdown.
Preserves heading hierarchy, lists, and tables across all formats.
@claude
Copy link

claude bot commented Mar 12, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

1 similar comment
@claude
Copy link

claude bot commented Mar 16, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

"pymupdf4llm>=1.27.2.1",
"docx>=0.2.4",
"bs4>=0.0.2",
"python-docx>=1.2.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conflicting dependency: docx>=0.2.4 (legacy 2014 stub) and python-docx>=1.2.0 both claim the docx namespace. The code imports from python-docx (from docx import Document). The legacy docx package can overwrite python-docx's module depending on install order, breaking imports at runtime. Remove the "docx>=0.2.4" line.

Similarly, "bs4>=0.0.2" is a PyPI shim that redirects to beautifulsoup4. Replace with "beautifulsoup4>=4.0" to depend on the actual library directly.

lines.append(f"{_HTML_HEADING_MAP[name]} {text}")
lines.append("")

elif name in ("ul", "ol"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ordered lists silently lose numbering: Both <ul> and <ol> are rendered as - item. For <ol>, this should produce 1. item, 2. item, etc. to preserve ordering semantics. The PR description claims to "preserve lists" but ordered list structure is discarded here.

# List items — python-docx exposes numPr when a paragraph is in a list
is_list = para._element.find(qn("w:numPr")) is not None
if is_list:
return f"- {text}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue for DOCX lists: w:numPr is present for both ordered and unordered lists. The w:numFmt child element distinguishes bullet vs decimal/roman-numeral, but is not consulted here. All list paragraphs are rendered as - item, silently discarding numbering for ordered lists.

# Use <body> if present, otherwise the whole document
root = soup.body or soup

for el in root.children:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only direct children of <body> are processed: This loop iterates root.children (immediate children only). Any content wrapped in <div>, <section>, <article>, or similar container elements -- which is extremely common in real-world HTML (especially Google Docs exports) -- will be silently dropped. This needs recursive traversal or flattening of container elements to avoid losing large portions of HTML documents.

_DISPATCH = {
"pdf": _extract_pdf,
"docx": _extract_docx,
"doc": _extract_docx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.doc is not supported by python-docx: This maps legacy .doc (pre-Word 2007 binary format) to the DOCX extractor, but python-docx only handles .docx (OOXML/ZIP). A real .doc file will raise BadZipFile at runtime with no clear error message. Either remove "doc" from the dispatch so the ValueError for unsupported types is raised, or add explicit error handling with a clear message.

@hamza-56
Copy link
Contributor

hamza-56 commented Mar 17, 2026

  1. Conflicting docx and python-docx dependencies -- both claim the docx namespace; bs4 is a shim for beautifulsoup4

sparkth/pyproject.toml

Lines 44 to 48 in ecb057d

"pymupdf4llm>=1.27.2.1",
"docx>=0.2.4",
"bs4>=0.0.2",
"python-docx>=1.2.0",
]

  1. Ordered lists lose numbering in HTML extraction -- <ol> items rendered as - item instead of 1. item

elif name in ("ul", "ol"):
for li in el.find_all("li", recursive=False):
text = li.get_text(" ", strip=True)
if text:
lines.append(f"- {text}")
lines.append("")

  1. Ordered lists lose numbering in DOCX extraction -- w:numPr is present for both ordered and unordered lists but w:numFmt is not consulted

# List items — python-docx exposes numPr when a paragraph is in a list
is_list = para._element.find(qn("w:numPr")) is not None
if is_list:
return f"- {text}"

  1. HTML extraction only processes direct children of <body> -- content inside <div>, <section>, <article> containers (common in real-world HTML) is silently dropped

root = soup.body or soup
for el in root.children:
if not isinstance(el, Tag) or not el.name:
continue

  1. .doc mapped to DOCX extractor but unsupported -- python-docx only handles .docx (OOXML); legacy .doc binary files will raise BadZipFile at runtime

_DISPATCH = {
"pdf": _extract_pdf,
"docx": _extract_docx,
"doc": _extract_docx,
"html": _extract_html,

@claude
Copy link

claude bot commented Mar 17, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@claude
Copy link

claude bot commented Mar 18, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

@claude
Copy link

claude bot commented Mar 18, 2026

Claude encountered an error —— View job


I'll analyze this and get back to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RAG - Structure Extraction

2 participants