-
Notifications
You must be signed in to change notification settings - Fork 0
Fix: Issue #6 #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Modified `WordDocumentReader` to detect "Heading1" through "Heading6" styles and prepend Markdown-style headers (`#` to `######`) to the text. - Added code comments explaining the heading detection logic. - Removed unused `HeaderRegex` from `SemanticChunker`. - Added `WordDocumentReaderTests` to verify that document structure is preserved. Fixes #6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request enhances the WordDocumentReader to preserve document structure by detecting heading styles (Heading1-Heading6) and converting them to Markdown-style headers, enabling the SemanticChunker to properly understand Word document hierarchies. The PR also removes an unused HeaderRegex from SemanticChunker and adds tests to verify the new functionality.
Changes:
- Enhanced
WordDocumentReaderto detect and convert Word heading styles to Markdown headers - Removed unused
HeaderRegex()method fromSemanticChunker - Added comprehensive test coverage for the heading detection feature
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| Solutions/Rlm.Cli/Core/Documents/WordDocumentReader.cs | Added heading style detection logic that converts Word heading styles (Heading1-Heading6) to Markdown headers (#-######) to preserve document structure |
| Solutions/Rlm.Cli/Core/Chunking/SemanticChunker.cs | Removed unused HeaderRegex() GeneratedRegex method that was never called in the codebase |
| Solutions/Rlm.Cli.Tests/Core/Documents/WordDocumentReaderTests.cs | Added new test class to verify that Word documents with heading styles are correctly converted to Markdown format |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Solutions/Rlm.Cli.Tests/Core/Documents/WordDocumentReaderTests.cs
Outdated
Show resolved
Hide resolved
Solutions/Rlm.Cli.Tests/Core/Documents/WordDocumentReaderTests.cs
Outdated
Show resolved
Hide resolved
Solutions/Rlm.Cli.Tests/Core/Documents/WordDocumentReaderTests.cs
Outdated
Show resolved
Hide resolved
| [TestMethod] | ||
| public async Task ReadAsync_DocumentWithHeadings_PreservesStructure() | ||
| { | ||
| // Arrange | ||
| CreateDocxWithHeadings(tempFilePath); | ||
| Uri uri = new(tempFilePath); | ||
|
|
||
| // Act | ||
| RlmDocument? document = await reader.ReadAsync(uri, TestContext.CancellationToken); | ||
|
|
||
| // Assert | ||
| document.ShouldNotBeNull(); | ||
|
|
||
| // This assertion is expected to FAIL currently | ||
| // We want to see: | ||
| // # Heading Level 1 | ||
| // Normal text | ||
| // ## Heading Level 2 | ||
| // More text | ||
|
|
||
| string content = document.Content; | ||
| content.ShouldContain("# Heading Level 1"); | ||
| content.ShouldContain("## Heading Level 2"); | ||
| } |
Copilot
AI
Jan 29, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test coverage for heading detection could be more comprehensive. Consider adding test cases for:
- Edge cases like Heading6 (the maximum supported level)
- Invalid cases like Heading7 or Heading0 (should not be converted to markdown headers)
- Empty paragraphs with heading styles
- Case variations (e.g., "heading1" vs "Heading1")
- Integration with normal paragraphs without heading styles
This would ensure the heading detection logic in WordDocumentReader.cs (lines 84-88) is properly validated for all scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Solutions/Rlm.Cli.Tests/Core/Documents/WordDocumentReaderTests.cs
Outdated
Show resolved
Hide resolved
|
@HowardvanRooijen I've opened a new pull request, #8, to work on those changes. Once the pull request is ready, I'll request review from you. |
* Initial plan * Fix platform-dependent newline assertion in WordDocumentReaderTests Co-authored-by: HowardvanRooijen <128664+HowardvanRooijen@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: HowardvanRooijen <128664+HowardvanRooijen@users.noreply.github.com>
Code Coverage Summary Report - Linux (No TFM)Summary
Coveragerlm - 65.2%
|
WordDocumentReaderto detect "Heading1" through "Heading6" styles and prepend Markdown-style headers (#to######) to the text.HeaderRegexfromSemanticChunker.WordDocumentReaderTeststo verify that document structure is preserved.Fixes #6