feat(rag): implement document structure extraction layer by zaira-bibi · Pull Request #199 · edly-io/sparkth

zaira-bibi · 2026-03-12T09:25:56Z

Supports PDF, DOCX, HTML and TXT/MD conversion to structured Markdown. Preserves heading hierarchy, lists, and tables across all formats.

Closes #197

claude · 2026-03-12T09:26:18Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

Supports PDF, DOCX, HTML and TXT/MD conversion to structured Markdown. Preserves heading hierarchy, lists, and tables across all formats.

claude · 2026-03-12T09:35:29Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

claude · 2026-03-16T02:43:50Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

hamza-56 · 2026-03-17T04:23:14Z

pyproject.toml

+    "pymupdf4llm>=1.27.2.1",
+    "docx>=0.2.4",
+    "bs4>=0.0.2",
+    "python-docx>=1.2.0",


Conflicting dependency: docx>=0.2.4 (legacy 2014 stub) and python-docx>=1.2.0 both claim the docx namespace. The code imports from python-docx (from docx import Document). The legacy docx package can overwrite python-docx's module depending on install order, breaking imports at runtime. Remove the "docx>=0.2.4" line.

Similarly, "bs4>=0.0.2" is a PyPI shim that redirects to beautifulsoup4. Replace with "beautifulsoup4>=4.0" to depend on the actual library directly.

hamza-56 · 2026-03-17T04:23:16Z

app/rag/extraction.py

+                lines.append(f"{_HTML_HEADING_MAP[name]} {text}")
+                lines.append("")
+
+        elif name in ("ul", "ol"):


Ordered lists silently lose numbering: Both <ul> and <ol> are rendered as - item. For <ol>, this should produce 1. item, 2. item, etc. to preserve ordering semantics. The PR description claims to "preserve lists" but ordered list structure is discarded here.

hamza-56 · 2026-03-17T04:23:17Z

app/rag/extraction.py

+    # List items — python-docx exposes numPr when a paragraph is in a list
+    is_list = para._element.find(qn("w:numPr")) is not None
+    if is_list:
+        return f"- {text}"


Same issue for DOCX lists: w:numPr is present for both ordered and unordered lists. The w:numFmt child element distinguishes bullet vs decimal/roman-numeral, but is not consulted here. All list paragraphs are rendered as - item, silently discarding numbering for ordered lists.

hamza-56 · 2026-03-17T04:23:18Z

app/rag/extraction.py

+    # Use <body> if present, otherwise the whole document
+    root = soup.body or soup
+
+    for el in root.children:


Only direct children of <body> are processed: This loop iterates root.children (immediate children only). Any content wrapped in <div>, <section>, <article>, or similar container elements -- which is extremely common in real-world HTML (especially Google Docs exports) -- will be silently dropped. This needs recursive traversal or flattening of container elements to avoid losing large portions of HTML documents.

hamza-56 · 2026-03-17T04:23:19Z

app/rag/extraction.py

+_DISPATCH = {
+    "pdf": _extract_pdf,
+    "docx": _extract_docx,
+    "doc": _extract_docx,


.doc is not supported by python-docx: This maps legacy .doc (pre-Word 2007 binary format) to the DOCX extractor, but python-docx only handles .docx (OOXML/ZIP). A real .doc file will raise BadZipFile at runtime with no clear error message. Either remove "doc" from the dispatch so the ValueError for unsupported types is raised, or add explicit error handling with a clear message.

hamza-56 · 2026-03-17T04:23:33Z

Conflicting docx and python-docx dependencies -- both claim the docx namespace; bs4 is a shim for beautifulsoup4

sparkth/pyproject.toml

Lines 44 to 48 in ecb057d

    
               "pymupdf4llm>=1.27.2.1", 
        
               "docx>=0.2.4", 
        
               "bs4>=0.0.2", 
        
               "python-docx>=1.2.0", 
        
           ]

Ordered lists lose numbering in HTML extraction -- <ol> items rendered as - item instead of 1. item

sparkth/app/rag/extraction.py

Lines 208 to 214 in ecb057d

    
           elif name in ("ul", "ol"): 
        
               for li in el.find_all("li", recursive=False): 
        
                   text = li.get_text(" ", strip=True) 
        
                   if text: 
        
                       lines.append(f"- {text}") 
        
               lines.append("")

Ordered lists lose numbering in DOCX extraction -- w:numPr is present for both ordered and unordered lists but w:numFmt is not consulted

sparkth/app/rag/extraction.py

Lines 70 to 74 in ecb057d

    
           # List items — python-docx exposes numPr when a paragraph is in a list 
        
           is_list = para._element.find(qn("w:numPr")) is not None 
        
           if is_list: 
        
               return f"- {text}"

HTML extraction only processes direct children of <body> -- content inside <div>, <section>, <article> containers (common in real-world HTML) is silently dropped

sparkth/app/rag/extraction.py

Lines 195 to 199 in ecb057d

    
           root = soup.body or soup 
        
           for el in root.children: 
        
               if not isinstance(el, Tag) or not el.name: 
        
                   continue

.doc mapped to DOCX extractor but unsupported -- python-docx only handles .docx (OOXML); legacy .doc binary files will raise BadZipFile at runtime

sparkth/app/rag/extraction.py

Lines 245 to 249 in ecb057d

    
           _DISPATCH = { 
        
               "pdf": _extract_pdf, 
        
               "docx": _extract_docx, 
        
               "doc": _extract_docx, 
        
               "html": _extract_html,

claude · 2026-03-17T10:25:24Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

claude · 2026-03-18T07:16:25Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

claude · 2026-03-18T07:48:53Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

zaira-bibi self-assigned this Mar 12, 2026

feat(rag): implement document structure extraction layer

ecb057d

Supports PDF, DOCX, HTML and TXT/MD conversion to structured Markdown. Preserves heading hierarchy, lists, and tables across all formats.

zaira-bibi force-pushed the zaira/rag-chunking branch from 6f0ec08 to ecb057d Compare March 12, 2026 09:35

hamza-56 reviewed Mar 17, 2026

View reviewed changes

zaira-bibi force-pushed the zaira/rag-chunking branch from 1399cc0 to a9a51da Compare March 18, 2026 07:16

chore: addressed review comments

11abcdb

zaira-bibi force-pushed the zaira/rag-chunking branch from a9a51da to 11abcdb Compare March 18, 2026 07:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rag): implement document structure extraction layer#199

feat(rag): implement document structure extraction layer#199
zaira-bibi wants to merge 2 commits intomainfrom
zaira/rag-chunking

zaira-bibi commented Mar 12, 2026 •

edited

Loading

Uh oh!

claude bot commented Mar 12, 2026 •

edited

Loading

Uh oh!

claude bot commented Mar 12, 2026 •

edited

Loading

Uh oh!

claude bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

hamza-56 Mar 17, 2026

Uh oh!

hamza-56 Mar 17, 2026

Uh oh!

hamza-56 Mar 17, 2026

Uh oh!

hamza-56 Mar 17, 2026

Uh oh!

hamza-56 Mar 17, 2026

Uh oh!

hamza-56 commented Mar 17, 2026 •

edited

Loading

Uh oh!

claude bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

claude bot commented Mar 18, 2026 •

edited

Loading

Uh oh!

claude bot commented Mar 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zaira-bibi commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hamza-56 Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

hamza-56 Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

hamza-56 Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

hamza-56 Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

hamza-56 Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

hamza-56 commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zaira-bibi commented Mar 12, 2026 •

edited

Loading

claude bot commented Mar 12, 2026 •

edited

Loading

claude bot commented Mar 12, 2026 •

edited

Loading

claude bot commented Mar 16, 2026 •

edited

Loading

hamza-56 commented Mar 17, 2026 •

edited

Loading

claude bot commented Mar 17, 2026 •

edited

Loading

claude bot commented Mar 18, 2026 •

edited

Loading

claude bot commented Mar 18, 2026 •

edited

Loading