-
Notifications
You must be signed in to change notification settings - Fork 37
Description
When using the docx2python to parse Word documents, tables with merged cells often trigger extraction errors
Symptom: The parser fails with "IndexError: list index out of range"
`File ".../parsers.py", line 46, in extractTextFromDoc
docxText = docxContent.text
^^^^^^^^^^^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_output.py", line 358, in text
return flatten_text(self.document_runs)
^^^^^^^^^^^^^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_output.py", line 280, in document_runs
+ self.body_runs
^^^^^^^^^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_output.py", line 254, in body_runs
return self.officeDocument_runs
^^^^^^^^^^^^^^^^^^^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_output.py", line 244, in officeDocument_runs
return get_par_strings(self.officeDocument_pars)
^^^^^^^^^^^^^^^^^^^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_output.py", line 180, in officeDocument_pars
return self._get_pars("officeDocument")
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_output.py", line 155, in _get_pars
content += file.content
^^^^^^^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_reader.py", line 265, in content
return self.get_content()
~~~~~~~~~~~~~~~~^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_reader.py", line 283, in get_content
return self.depth_collector.tree
^^^^^^^^^^^^^^^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_reader.py", line 256, in depth_collector
self.__depth_collector = self.__depth_collector or new_depth_collector(self)
~~~~~~~~~~~~~~~~~~~^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_text.py", line 392, in new_depth_collector
branches(root)
~~~~~~~~^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_text.py", line 388, in branches
branches(branch)
~~~~~~~~^^^^^^^^
File ....venv/lib/python3.13/site-packages/docx2python/docx_text.py", line 388, in branches
branches(branch)
~~~~~~~~^^^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_text.py", line 388, in branches
branches(branch)
~~~~~~~~^^^^^^^^
[Previous line repeated 1 more time]
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_text.py", line 390, in branches
tag_runner.close(tree)
~~~~~~~~~~~~~~~~^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_text.py", line 160, in close
method(tree)
~~~~~~^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_text.py", line 339, in _close_table_cell
this_tr[-1] = copy.deepcopy(prev_tr[tc_idx])
~~~~~~~^^^^^^^^
IndexError: list index out of range
`