Fix: Issue #6 #7

HowardvanRooijen · 2026-01-29T11:45:15Z

Modified WordDocumentReader to detect "Heading1" through "Heading6" styles and prepend Markdown-style headers (# to ######) to the text.
Added code comments explaining the heading detection logic.
Removed unused HeaderRegex from SemanticChunker.
Added WordDocumentReaderTests to verify that document structure is preserved.

Fixes #6

- Modified `WordDocumentReader` to detect "Heading1" through "Heading6" styles and prepend Markdown-style headers (`#` to `######`) to the text. - Added code comments explaining the heading detection logic. - Removed unused `HeaderRegex` from `SemanticChunker`. - Added `WordDocumentReaderTests` to verify that document structure is preserved. Fixes #6

github-actions · 2026-01-29T11:47:36Z

Test Results

394 tests +15 394 ✅ +15 6s ⏱️ -1s
1 suites ± 0 0 💤 ± 0
1 files ± 0 0 ❌ ± 0

Results for commit 95b1a4b. ± Comparison against base commit 89f33c5.

♻️ This comment has been updated with latest results.

Copilot

Pull request overview

This pull request enhances the WordDocumentReader to preserve document structure by detecting heading styles (Heading1-Heading6) and converting them to Markdown-style headers, enabling the SemanticChunker to properly understand Word document hierarchies. The PR also removes an unused HeaderRegex from SemanticChunker and adds tests to verify the new functionality.

Changes:

Enhanced WordDocumentReader to detect and convert Word heading styles to Markdown headers
Removed unused HeaderRegex() method from SemanticChunker
Added comprehensive test coverage for the heading detection feature

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
Solutions/Rlm.Cli/Core/Documents/WordDocumentReader.cs	Added heading style detection logic that converts Word heading styles (Heading1-Heading6) to Markdown headers (#-######) to preserve document structure
Solutions/Rlm.Cli/Core/Chunking/SemanticChunker.cs	Removed unused `HeaderRegex()` GeneratedRegex method that was never called in the codebase
Solutions/Rlm.Cli.Tests/Core/Documents/WordDocumentReaderTests.cs	Added new test class to verify that Word documents with heading styles are correctly converted to Markdown format

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Solutions/Rlm.Cli.Tests/Core/Documents/WordDocumentReaderTests.cs

Copilot · 2026-01-29T11:50:36Z

Solutions/Rlm.Cli.Tests/Core/Documents/WordDocumentReaderTests.cs

+    [TestMethod]
+    public async Task ReadAsync_DocumentWithHeadings_PreservesStructure()
+    {
+        // Arrange
+        CreateDocxWithHeadings(tempFilePath);
+        Uri uri = new(tempFilePath);
+
+        // Act
+        RlmDocument? document = await reader.ReadAsync(uri, TestContext.CancellationToken);
+
+        // Assert
+        document.ShouldNotBeNull();
+
+        // This assertion is expected to FAIL currently
+        // We want to see:
+        // # Heading Level 1
+        // Normal text
+        // ## Heading Level 2
+        // More text
+
+        string content = document.Content;
+        content.ShouldContain("# Heading Level 1");
+        content.ShouldContain("## Heading Level 2");
+    }


The test coverage for heading detection could be more comprehensive. Consider adding test cases for:

Edge cases like Heading6 (the maximum supported level)

Invalid cases like Heading7 or Heading0 (should not be converted to markdown headers)

Empty paragraphs with heading styles

Case variations (e.g., "heading1" vs "Heading1")

Integration with normal paragraphs without heading styles

This would ensure the heading detection logic in WordDocumentReader.cs (lines 84-88) is properly validated for all scenarios.

…mplex documents

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Solutions/Rlm.Cli.Tests/Core/Documents/WordDocumentReaderTests.cs

Copilot · 2026-01-29T12:29:10Z

@HowardvanRooijen I've opened a new pull request, #8, to work on those changes. Once the pull request is ready, I'll request review from you.

* Initial plan * Fix platform-dependent newline assertion in WordDocumentReaderTests Co-authored-by: HowardvanRooijen <128664+HowardvanRooijen@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: HowardvanRooijen <128664+HowardvanRooijen@users.noreply.github.com>

…data extraction

github-actions · 2026-01-30T06:58:11Z

Code Coverage Summary Report - Linux (No TFM)

Summary


Generated on:	01/30/2026 - 06:58:01
Parser:	Cobertura
Assemblies:	1
Classes:	68
Files:	69
Line coverage:	65.2% (3941 of 6044)
Covered lines:	3941
Uncovered lines:	2103
Coverable lines:	6044
Total lines:	8674
Branch coverage:	55.6% (1408 of 2530)
Covered branches:	1408
Total branches:	2530
Method coverage:	Feature is only available for sponsors

Coverage

rlm - 65.2%

Name	Line	Branch
rlm	65.2%	55.6%
Program	0%	0%
Rlm.Cli.Commands.AggregateCommand	92.5%	87.5%
Rlm.Cli.Commands.AggregateCommand.Settings	100%
Rlm.Cli.Commands.ChunkCommand	95.4%	77.1%
Rlm.Cli.Commands.ChunkCommand.Settings	100%
Rlm.Cli.Commands.ClearCommand	88%	78.5%
Rlm.Cli.Commands.ClearCommand.Settings	100%
Rlm.Cli.Commands.FilterCommand	100%	100%
Rlm.Cli.Commands.FilterCommand.Settings	100%
Rlm.Cli.Commands.ImportCommand	0%	0%
Rlm.Cli.Commands.ImportCommand.Settings	0%
Rlm.Cli.Commands.InfoCommand	42.3%	30.3%
Rlm.Cli.Commands.InfoCommand.Settings	100%
Rlm.Cli.Commands.JumpCommand	93.3%	76.9%
Rlm.Cli.Commands.JumpCommand.Settings	100%
Rlm.Cli.Commands.LoadCommand	44.2%	31.2%
Rlm.Cli.Commands.LoadCommand.Settings	66.6%
Rlm.Cli.Commands.NextCommand	64.7%	57.1%
Rlm.Cli.Commands.NextCommand.Settings	100%
Rlm.Cli.Commands.ResultsCommand	100%	100%
Rlm.Cli.Commands.SkipCommand	94.3%	84.3%
Rlm.Cli.Commands.SkipCommand.Settings	100%
Rlm.Cli.Commands.SliceCommand	89.7%	82.1%
Rlm.Cli.Commands.SliceCommand.Settings	100%
Rlm.Cli.Commands.StoreCommand	84.2%	83.3%
Rlm.Cli.Commands.StoreCommand.Settings	100%
Rlm.Cli.Core.Chunking.ChunkProcessorChain	100%	100%
Rlm.Cli.Core.Chunking.ChunkStatisticsProcessor	100%	100%
Rlm.Cli.Core.Chunking.ContentChunk	100%
Rlm.Cli.Core.Chunking.FilteringChunker	96.6%	93.7%
Rlm.Cli.Core.Chunking.FilteringChunker.Segment	100%
Rlm.Cli.Core.Chunking.RecursiveChunker	94.7%	87.5%
Rlm.Cli.Core.Chunking.RecursiveChunker.ChunkSegment	100%
Rlm.Cli.Core.Chunking.SemanticChunker	52.3%	37.5%
Rlm.Cli.Core.Chunking.SemanticChunker.Section	100%
Rlm.Cli.Core.Chunking.TokenBasedChunker	98.5%	87.5%
Rlm.Cli.Core.Chunking.UniformChunker	100%	83.3%
Rlm.Cli.Core.Documents.CompositeDocumentReader	53.8%	33.3%
Rlm.Cli.Core.Documents.ContentCleaningProcessor	100%	100%
Rlm.Cli.Core.Documents.DocumentMetadata	100%
Rlm.Cli.Core.Documents.DocumentProcessorChain	100%	100%
Rlm.Cli.Core.Documents.DocumentReaderExtensions	12.5%	5.8%
Rlm.Cli.Core.Documents.FileDocumentReader	48.5%	15.5%
Rlm.Cli.Core.Documents.HtmlDocumentReader	19.3%	18.1%
Rlm.Cli.Core.Documents.JsonDocumentReader	9.4%	12.5%
Rlm.Cli.Core.Documents.MarkdownDocumentReader	5.8%	8.3%
Rlm.Cli.Core.Documents.MetadataExtractionProcessor	100%	92%
Rlm.Cli.Core.Documents.PdfDocumentReader	9.4%	11.1%
Rlm.Cli.Core.Documents.RlmDocument	100%
Rlm.Cli.Core.Documents.StdinDocumentReader	4.3%	0%
Rlm.Cli.Core.Documents.WordDocumentReader	98.3%	82.5%
Rlm.Cli.Core.Output.AggregateOutput	100%
Rlm.Cli.Core.Output.ChunkOutput	100%
Rlm.Cli.Core.Output.SessionInfoOutput	0%
Rlm.Cli.Core.Session.ResultBuffer	100%	100%
Rlm.Cli.Core.Session.RlmSession	100%	100%
Rlm.Cli.Core.Validation.CompositeValidator	100%	100%
Rlm.Cli.Core.Validation.RangeValidator	100%	100%
Rlm.Cli.Core.Validation.SyntacticValidator	94.1%	94.4%
Rlm.Cli.Core.Validation.ValidationResult	100%
Rlm.Cli.Infrastructure.RlmCommandSettings	100%
Rlm.Cli.Infrastructure.RlmJsonContext	63%	65.8%
Rlm.Cli.Infrastructure.SessionStore	93.4%	81.8%
Rlm.Cli.Infrastructure.TypeRegistrar	100%
Rlm.Cli.Infrastructure.TypeResolver	100%	100%
System.Text.RegularExpressions.Generated	65.1%	47.8%
System.Text.RegularExpressions.Generated.RunnerFactory
System.Text.RegularExpressions.Generated.RunnerFactory.Runner

Copilot AI review requested due to automatic review settings January 29, 2026 11:45

Copilot started reviewing on behalf of HowardvanRooijen January 29, 2026 11:45 View session

github-actions bot added the pending_release label Jan 29, 2026

HowardvanRooijen mentioned this pull request Jan 29, 2026

Could WordDocumentReader provide more semantic structure? #6

Closed

Copilot AI reviewed Jan 29, 2026

View reviewed changes

Add tests for WordDocumentReader to verify heading preservation in co…

e678b50

…mplex documents

HowardvanRooijen requested a review from Copilot January 29, 2026 12:09

Copilot started reviewing on behalf of HowardvanRooijen January 29, 2026 12:09 View session

Copilot AI reviewed Jan 29, 2026

View reviewed changes

Solutions/Rlm.Cli.Tests/Core/Documents/WordDocumentReaderTests.cs Outdated Show resolved Hide resolved

Copilot AI mentioned this pull request Jan 29, 2026

Fix platform-dependent newline in WordDocumentReader test assertion #8

Merged

Copilot AI and others added 2 commits January 29, 2026 12:38

Add comprehensive tests for WordDocumentReader functionality and meta…

95b1a4b

…data extraction

HowardvanRooijen merged commit 53f81b6 into main Jan 30, 2026
2 checks passed

HowardvanRooijen deleted the feature/issue-6 branch January 30, 2026 12:35

endjin-bot bot removed the pending_release label Jan 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Issue #6 #7

Fix: Issue #6 #7

Uh oh!

HowardvanRooijen commented Jan 29, 2026

Uh oh!

github-actions bot commented Jan 29, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI commented Jan 29, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: Issue #6 #7

Fix: Issue #6 #7

Uh oh!

Conversation

HowardvanRooijen commented Jan 29, 2026

Uh oh!

github-actions bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI commented Jan 29, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Code Coverage Summary Report - Linux (No TFM)

Coverage

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Jan 29, 2026 •

edited

Loading