Skip to content

Word document with merged table cells parsing error #93

@ronwsg

Description

@ronwsg

When using the docx2python to parse Word documents, tables with merged cells often trigger extraction errors
Symptom: The parser fails with "IndexError: list index out of range"

`File ".../parsers.py", line 46, in extractTextFromDoc
docxText = docxContent.text
^^^^^^^^^^^^^^^^

File ".../.venv/lib/python3.13/site-packages/docx2python/docx_output.py", line 358, in text
return flatten_text(self.document_runs)
^^^^^^^^^^^^^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_output.py", line 280, in document_runs
+ self.body_runs
^^^^^^^^^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_output.py", line 254, in body_runs
return self.officeDocument_runs
^^^^^^^^^^^^^^^^^^^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_output.py", line 244, in officeDocument_runs
return get_par_strings(self.officeDocument_pars)
^^^^^^^^^^^^^^^^^^^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_output.py", line 180, in officeDocument_pars
return self._get_pars("officeDocument")
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_output.py", line 155, in _get_pars
content += file.content
^^^^^^^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_reader.py", line 265, in content
return self.get_content()
~~~~~~~~~~~~~~~~^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_reader.py", line 283, in get_content
return self.depth_collector.tree
^^^^^^^^^^^^^^^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_reader.py", line 256, in depth_collector
self.__depth_collector = self.__depth_collector or new_depth_collector(self)
~~~~~~~~~~~~~~~~~~~^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_text.py", line 392, in new_depth_collector
branches(root)
~~~~~~~~^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_text.py", line 388, in branches
branches(branch)
~~~~~~~~^^^^^^^^
File ....venv/lib/python3.13/site-packages/docx2python/docx_text.py", line 388, in branches
branches(branch)
~~~~~~~~^^^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_text.py", line 388, in branches
branches(branch)
~~~~~~~~^^^^^^^^
[Previous line repeated 1 more time]
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_text.py", line 390, in branches
tag_runner.close(tree)
~~~~~~~~~~~~~~~~^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_text.py", line 160, in close
method(tree)
~~~~~~^^^^^^
File ".../.venv/lib/python3.13/site-packages/docx2python/docx_text.py", line 339, in _close_table_cell
this_tr[-1] = copy.deepcopy(prev_tr[tc_idx])
~~~~~~~^^^^^^^^
IndexError: list index out of range
`

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions