Fix clipping when malformed <figure> nests article content#151
Fix clipping when malformed <figure> nests article content#151shuuul wants to merge 2 commits intokepano:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes a regression where extraction/markdown conversion was clipped when malformed/unclosed <figure> tags caused subsequent article content to be nested inside a figure.
Changes:
- Skip figure standardization when a
<figure>contains apparent non-image block content (malformed HTML signal). - Update markdown figure conversion to preserve any remaining non-image content after emitting the image + caption.
- Add Lilian Weng “Why We Think” regression fixture + expected markdown snapshot, plus a focused regression test.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/markdown.ts |
Extends the Turndown <figure> rule to emit image+caption and then convert/preserve any remaining nested content. |
src/elements/images.ts |
Adds a malformed-figure detector (hasNonImageContent) to bypass standardization when a figure contains apparent non-image content. |
tests/lilian-thinking.test.ts |
Adds a focused regression test ensuring extraction reaches Citation/References and meets a minimum word count. |
tests/fixtures/general--lilianweng.github.io:posts:2025-05-01-thinking:.html |
Adds the regression HTML fixture exhibiting malformed/unclosed <figure> nesting behavior. |
tests/expected/general--lilianweng.github.io:posts:2025-05-01-thinking:.md |
Adds the expected extracted markdown snapshot for the new fixture. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/markdown.ts
Outdated
| const cloneImg = clone.querySelector('img'); | ||
| if (cloneImg) { | ||
| const parentLink = cloneImg.closest('a'); | ||
| if (parentLink && parentLink !== clone) { | ||
| parentLink.remove(); | ||
| } else { | ||
| cloneImg.remove(); | ||
| } |
There was a problem hiding this comment.
In the clone cleanup, removing the closest parent (parentLink.remove()) will drop any non-image content inside that link (e.g. text nodes or other elements) and defeats the goal of preserving non-image content from malformed figures. Consider removing just the (or unwrapping the while preserving its other children) instead of removing the entire anchor element.
There was a problem hiding this comment.
Addressed in 4219bb2. The cleanup now removes only the cloned <img> instead of dropping the entire parent link, so linked text/content inside malformed figures is preserved. I also added a focused regression test in tests/markdown-figures.test.ts covering this case.
src/markdown.ts
Outdated
| let imageMarkdown = `\n\n${caption}\n\n`; | ||
|
|
||
| // Preserve non-image content inside the figure (e.g. when an unclosed | ||
| // <figure> tag causes subsequent content to be nested inside it) | ||
| const clone = node.cloneNode(true) as Element; |
There was a problem hiding this comment.
This rule now always clones the entire
, serializes it, and runs an extra Turndown pass to look for remaining content. For the common case of well-formed figures (only image+caption), this adds avoidable work during markdown conversion. Consider short-circuiting unless the figure appears malformed / contains extra non-image nodes (e.g. a quick structural check before cloning), or only running the extra conversion when leftover content is actually possible.There was a problem hiding this comment.
Addressed in 4219bb2. I added a hasResidualFigureContent guard so well-formed image+caption figures return early and skip the extra clone/serialization/Turndown pass. The malformed-figure regression still passes.
|
Posted a follow-up commit to address the Copilot review comments: Changes in that commit:
Verification:
|
Summary
This fixes a regression where Defuddle clipped the Lilian Weng "Why We Think" article partway through extraction.
The root cause was a malformed or unclosed
<figure>in the source HTML. Figure normalization treated the node as a standard image figure, which caused later article content nested inside that figure to be collapsed during markdown conversion.Changes
<figure>contains non-image content that suggests malformed HTMLhttps://lilianweng.github.io/posts/2025-05-01-thinking/Testing
npm test -- tests/lilian-thinking.test.ts tests/fixtures.test.ts