Skip to content

Conversation

@HowardvanRooijen
Copy link
Member

  • Modified WordDocumentReader to detect "Heading1" through "Heading6" styles and prepend Markdown-style headers (# to ######) to the text.
  • Added code comments explaining the heading detection logic.
  • Removed unused HeaderRegex from SemanticChunker.
  • Added WordDocumentReaderTests to verify that document structure is preserved.

Fixes #6

- Modified `WordDocumentReader` to detect "Heading1" through "Heading6" styles and prepend Markdown-style headers (`#` to `######`) to the text.
- Added code comments explaining the heading detection logic.
- Removed unused `HeaderRegex` from `SemanticChunker`.
- Added `WordDocumentReaderTests` to verify that document structure is preserved.

Fixes #6
@github-actions
Copy link

github-actions bot commented Jan 29, 2026

Test Results

394 tests  +15   394 ✅ +15   6s ⏱️ -1s
  1 suites ± 0     0 💤 ± 0 
  1 files   ± 0     0 ❌ ± 0 

Results for commit 95b1a4b. ± Comparison against base commit 89f33c5.

♻️ This comment has been updated with latest results.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request enhances the WordDocumentReader to preserve document structure by detecting heading styles (Heading1-Heading6) and converting them to Markdown-style headers, enabling the SemanticChunker to properly understand Word document hierarchies. The PR also removes an unused HeaderRegex from SemanticChunker and adds tests to verify the new functionality.

Changes:

  • Enhanced WordDocumentReader to detect and convert Word heading styles to Markdown headers
  • Removed unused HeaderRegex() method from SemanticChunker
  • Added comprehensive test coverage for the heading detection feature

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
Solutions/Rlm.Cli/Core/Documents/WordDocumentReader.cs Added heading style detection logic that converts Word heading styles (Heading1-Heading6) to Markdown headers (#-######) to preserve document structure
Solutions/Rlm.Cli/Core/Chunking/SemanticChunker.cs Removed unused HeaderRegex() GeneratedRegex method that was never called in the codebase
Solutions/Rlm.Cli.Tests/Core/Documents/WordDocumentReaderTests.cs Added new test class to verify that Word documents with heading styles are correctly converted to Markdown format

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 50 to 73
[TestMethod]
public async Task ReadAsync_DocumentWithHeadings_PreservesStructure()
{
// Arrange
CreateDocxWithHeadings(tempFilePath);
Uri uri = new(tempFilePath);

// Act
RlmDocument? document = await reader.ReadAsync(uri, TestContext.CancellationToken);

// Assert
document.ShouldNotBeNull();

// This assertion is expected to FAIL currently
// We want to see:
// # Heading Level 1
// Normal text
// ## Heading Level 2
// More text

string content = document.Content;
content.ShouldContain("# Heading Level 1");
content.ShouldContain("## Heading Level 2");
}
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test coverage for heading detection could be more comprehensive. Consider adding test cases for:

  1. Edge cases like Heading6 (the maximum supported level)
  2. Invalid cases like Heading7 or Heading0 (should not be converted to markdown headers)
  3. Empty paragraphs with heading styles
  4. Case variations (e.g., "heading1" vs "Heading1")
  5. Integration with normal paragraphs without heading styles

This would ensure the heading detection logic in WordDocumentReader.cs (lines 84-88) is properly validated for all scenarios.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI commented Jan 29, 2026

@HowardvanRooijen I've opened a new pull request, #8, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI and others added 2 commits January 29, 2026 12:38
* Initial plan

* Fix platform-dependent newline assertion in WordDocumentReaderTests

Co-authored-by: HowardvanRooijen <128664+HowardvanRooijen@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: HowardvanRooijen <128664+HowardvanRooijen@users.noreply.github.com>
@github-actions
Copy link

Code Coverage Summary Report - Linux (No TFM)

Summary
Generated on: 01/30/2026 - 06:58:01
Parser: Cobertura
Assemblies: 1
Classes: 68
Files: 69
Line coverage: 65.2% (3941 of 6044)
Covered lines: 3941
Uncovered lines: 2103
Coverable lines: 6044
Total lines: 8674
Branch coverage: 55.6% (1408 of 2530)
Covered branches: 1408
Total branches: 2530
Method coverage: Feature is only available for sponsors

Coverage

rlm - 65.2%
Name Line Branch
rlm 65.2% 55.6%
Program 0% 0%
Rlm.Cli.Commands.AggregateCommand 92.5% 87.5%
Rlm.Cli.Commands.AggregateCommand.Settings 100%
Rlm.Cli.Commands.ChunkCommand 95.4% 77.1%
Rlm.Cli.Commands.ChunkCommand.Settings 100%
Rlm.Cli.Commands.ClearCommand 88% 78.5%
Rlm.Cli.Commands.ClearCommand.Settings 100%
Rlm.Cli.Commands.FilterCommand 100% 100%
Rlm.Cli.Commands.FilterCommand.Settings 100%
Rlm.Cli.Commands.ImportCommand 0% 0%
Rlm.Cli.Commands.ImportCommand.Settings 0%
Rlm.Cli.Commands.InfoCommand 42.3% 30.3%
Rlm.Cli.Commands.InfoCommand.Settings 100%
Rlm.Cli.Commands.JumpCommand 93.3% 76.9%
Rlm.Cli.Commands.JumpCommand.Settings 100%
Rlm.Cli.Commands.LoadCommand 44.2% 31.2%
Rlm.Cli.Commands.LoadCommand.Settings 66.6%
Rlm.Cli.Commands.NextCommand 64.7% 57.1%
Rlm.Cli.Commands.NextCommand.Settings 100%
Rlm.Cli.Commands.ResultsCommand 100% 100%
Rlm.Cli.Commands.SkipCommand 94.3% 84.3%
Rlm.Cli.Commands.SkipCommand.Settings 100%
Rlm.Cli.Commands.SliceCommand 89.7% 82.1%
Rlm.Cli.Commands.SliceCommand.Settings 100%
Rlm.Cli.Commands.StoreCommand 84.2% 83.3%
Rlm.Cli.Commands.StoreCommand.Settings 100%
Rlm.Cli.Core.Chunking.ChunkProcessorChain 100% 100%
Rlm.Cli.Core.Chunking.ChunkStatisticsProcessor 100% 100%
Rlm.Cli.Core.Chunking.ContentChunk 100%
Rlm.Cli.Core.Chunking.FilteringChunker 96.6% 93.7%
Rlm.Cli.Core.Chunking.FilteringChunker.Segment 100%
Rlm.Cli.Core.Chunking.RecursiveChunker 94.7% 87.5%
Rlm.Cli.Core.Chunking.RecursiveChunker.ChunkSegment 100%
Rlm.Cli.Core.Chunking.SemanticChunker 52.3% 37.5%
Rlm.Cli.Core.Chunking.SemanticChunker.Section 100%
Rlm.Cli.Core.Chunking.TokenBasedChunker 98.5% 87.5%
Rlm.Cli.Core.Chunking.UniformChunker 100% 83.3%
Rlm.Cli.Core.Documents.CompositeDocumentReader 53.8% 33.3%
Rlm.Cli.Core.Documents.ContentCleaningProcessor 100% 100%
Rlm.Cli.Core.Documents.DocumentMetadata 100%
Rlm.Cli.Core.Documents.DocumentProcessorChain 100% 100%
Rlm.Cli.Core.Documents.DocumentReaderExtensions 12.5% 5.8%
Rlm.Cli.Core.Documents.FileDocumentReader 48.5% 15.5%
Rlm.Cli.Core.Documents.HtmlDocumentReader 19.3% 18.1%
Rlm.Cli.Core.Documents.JsonDocumentReader 9.4% 12.5%
Rlm.Cli.Core.Documents.MarkdownDocumentReader 5.8% 8.3%
Rlm.Cli.Core.Documents.MetadataExtractionProcessor 100% 92%
Rlm.Cli.Core.Documents.PdfDocumentReader 9.4% 11.1%
Rlm.Cli.Core.Documents.RlmDocument 100%
Rlm.Cli.Core.Documents.StdinDocumentReader 4.3% 0%
Rlm.Cli.Core.Documents.WordDocumentReader 98.3% 82.5%
Rlm.Cli.Core.Output.AggregateOutput 100%
Rlm.Cli.Core.Output.ChunkOutput 100%
Rlm.Cli.Core.Output.SessionInfoOutput 0%
Rlm.Cli.Core.Session.ResultBuffer 100% 100%
Rlm.Cli.Core.Session.RlmSession 100% 100%
Rlm.Cli.Core.Validation.CompositeValidator 100% 100%
Rlm.Cli.Core.Validation.RangeValidator 100% 100%
Rlm.Cli.Core.Validation.SyntacticValidator 94.1% 94.4%
Rlm.Cli.Core.Validation.ValidationResult 100%
Rlm.Cli.Infrastructure.RlmCommandSettings 100%
Rlm.Cli.Infrastructure.RlmJsonContext 63% 65.8%
Rlm.Cli.Infrastructure.SessionStore 93.4% 81.8%
Rlm.Cli.Infrastructure.TypeRegistrar 100%
Rlm.Cli.Infrastructure.TypeResolver 100% 100%
System.Text.RegularExpressions.Generated 65.1% 47.8%
System.Text.RegularExpressions.Generated.RunnerFactory
System.Text.RegularExpressions.Generated.RunnerFactory.Runner

@HowardvanRooijen HowardvanRooijen merged commit 53f81b6 into main Jan 30, 2026
2 checks passed
@HowardvanRooijen HowardvanRooijen deleted the feature/issue-6 branch January 30, 2026 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Could WordDocumentReader provide more semantic structure?

2 participants