feat(rag): implement document structure extraction layer#199
feat(rag): implement document structure extraction layer#199zaira-bibi wants to merge 2 commits intomainfrom
Conversation
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
Supports PDF, DOCX, HTML and TXT/MD conversion to structured Markdown. Preserves heading hierarchy, lists, and tables across all formats.
6f0ec08 to
ecb057d
Compare
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
1 similar comment
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
| "pymupdf4llm>=1.27.2.1", | ||
| "docx>=0.2.4", | ||
| "bs4>=0.0.2", | ||
| "python-docx>=1.2.0", |
There was a problem hiding this comment.
Conflicting dependency: docx>=0.2.4 (legacy 2014 stub) and python-docx>=1.2.0 both claim the docx namespace. The code imports from python-docx (from docx import Document). The legacy docx package can overwrite python-docx's module depending on install order, breaking imports at runtime. Remove the "docx>=0.2.4" line.
Similarly, "bs4>=0.0.2" is a PyPI shim that redirects to beautifulsoup4. Replace with "beautifulsoup4>=4.0" to depend on the actual library directly.
| lines.append(f"{_HTML_HEADING_MAP[name]} {text}") | ||
| lines.append("") | ||
|
|
||
| elif name in ("ul", "ol"): |
There was a problem hiding this comment.
Ordered lists silently lose numbering: Both <ul> and <ol> are rendered as - item. For <ol>, this should produce 1. item, 2. item, etc. to preserve ordering semantics. The PR description claims to "preserve lists" but ordered list structure is discarded here.
app/rag/extraction.py
Outdated
| # List items — python-docx exposes numPr when a paragraph is in a list | ||
| is_list = para._element.find(qn("w:numPr")) is not None | ||
| if is_list: | ||
| return f"- {text}" |
There was a problem hiding this comment.
Same issue for DOCX lists: w:numPr is present for both ordered and unordered lists. The w:numFmt child element distinguishes bullet vs decimal/roman-numeral, but is not consulted here. All list paragraphs are rendered as - item, silently discarding numbering for ordered lists.
app/rag/extraction.py
Outdated
| # Use <body> if present, otherwise the whole document | ||
| root = soup.body or soup | ||
|
|
||
| for el in root.children: |
There was a problem hiding this comment.
Only direct children of <body> are processed: This loop iterates root.children (immediate children only). Any content wrapped in <div>, <section>, <article>, or similar container elements -- which is extremely common in real-world HTML (especially Google Docs exports) -- will be silently dropped. This needs recursive traversal or flattening of container elements to avoid losing large portions of HTML documents.
app/rag/extraction.py
Outdated
| _DISPATCH = { | ||
| "pdf": _extract_pdf, | ||
| "docx": _extract_docx, | ||
| "doc": _extract_docx, |
There was a problem hiding this comment.
.doc is not supported by python-docx: This maps legacy .doc (pre-Word 2007 binary format) to the DOCX extractor, but python-docx only handles .docx (OOXML/ZIP). A real .doc file will raise BadZipFile at runtime with no clear error message. Either remove "doc" from the dispatch so the ValueError for unsupported types is raised, or add explicit error handling with a clear message.
Lines 44 to 48 in ecb057d
Lines 208 to 214 in ecb057d
Lines 70 to 74 in ecb057d
Lines 195 to 199 in ecb057d
Lines 245 to 249 in ecb057d |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
1399cc0 to
a9a51da
Compare
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
a9a51da to
11abcdb
Compare
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
Supports PDF, DOCX, HTML and TXT/MD conversion to structured Markdown. Preserves heading hierarchy, lists, and tables across all formats.
Closes #197