Fix clipping when malformed <figure> nests article content by shuuul · Pull Request #151 · kepano/defuddle

shuuul · 2026-03-09T12:41:58Z

Summary

This fixes a regression where Defuddle clipped the Lilian Weng "Why We Think" article partway through extraction.

The root cause was a malformed or unclosed <figure> in the source HTML. Figure normalization treated the node as a standard image figure, which caused later article content nested inside that figure to be collapsed during markdown conversion.

Changes

skip figure standardization when a <figure> contains non-image content that suggests malformed HTML
preserve remaining non-image content inside a figure after extracting the image and caption to markdown
add a regression fixture for https://lilianweng.github.io/posts/2025-05-01-thinking/
add snapshot coverage for the full extracted markdown
add a focused regression test asserting the article is extracted through the citation and references sections

Testing

npm test -- tests/lilian-thinking.test.ts tests/fixtures.test.ts

Copilot

Pull request overview

Fixes a regression where extraction/markdown conversion was clipped when malformed/unclosed <figure> tags caused subsequent article content to be nested inside a figure.

Changes:

Skip figure standardization when a <figure> contains apparent non-image block content (malformed HTML signal).
Update markdown figure conversion to preserve any remaining non-image content after emitting the image + caption.
Add Lilian Weng “Why We Think” regression fixture + expected markdown snapshot, plus a focused regression test.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`src/markdown.ts`	Extends the Turndown `<figure>` rule to emit image+caption and then convert/preserve any remaining nested content.
`src/elements/images.ts`	Adds a malformed-figure detector (`hasNonImageContent`) to bypass standardization when a figure contains apparent non-image content.
`tests/lilian-thinking.test.ts`	Adds a focused regression test ensuring extraction reaches Citation/References and meets a minimum word count.
`tests/fixtures/general--lilianweng.github.io:posts:2025-05-01-thinking:.html`	Adds the regression HTML fixture exhibiting malformed/unclosed `<figure>` nesting behavior.
`tests/expected/general--lilianweng.github.io:posts:2025-05-01-thinking:.md`	Adds the expected extracted markdown snapshot for the new fixture.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-09T12:47:44Z

src/markdown.ts

+			const cloneImg = clone.querySelector('img');
+			if (cloneImg) {
+				const parentLink = cloneImg.closest('a');
+				if (parentLink && parentLink !== clone) {
+					parentLink.remove();
+				} else {
+					cloneImg.remove();
+				}


In the clone cleanup, removing the closest parent (parentLink.remove()) will drop any non-image content inside that link (e.g. text nodes or other elements) and defeats the goal of preserving non-image content from malformed figures. Consider removing just the (or unwrapping the while preserving its other children) instead of removing the entire anchor element.

Addressed in 4219bb2. The cleanup now removes only the cloned <img> instead of dropping the entire parent link, so linked text/content inside malformed figures is preserved. I also added a focused regression test in tests/markdown-figures.test.ts covering this case.

Copilot · 2026-03-09T12:47:44Z

src/markdown.ts

+			let imageMarkdown = `![${alt}](${src})\n\n${caption}\n\n`;
+
+			// Preserve non-image content inside the figure (e.g. when an unclosed
+			// <figure> tag causes subsequent content to be nested inside it)
+			const clone = node.cloneNode(true) as Element;


This rule now always clones the entire
, serializes it, and runs an extra Turndown pass to look for remaining content. For the common case of well-formed figures (only image+caption), this adds avoidable work during markdown conversion. Consider short-circuiting unless the figure appears malformed / contains extra non-image nodes (e.g. a quick structural check before cloning), or only running the extra conversion when leftover content is actually possible.

Addressed in 4219bb2. I added a hasResidualFigureContent guard so well-formed image+caption figures return early and skip the extra clone/serialization/Turndown pass. The malformed-figure regression still passes.

shuuul · 2026-03-09T13:57:43Z

Posted a follow-up commit to address the Copilot review comments: 4219bb2.

Changes in that commit:

preserve linked text/content when stripping the cloned figure image
short-circuit the extra figure cleanup path unless residual non-image content is present
add a focused regression test for linked content inside a malformed figure

Verification:

npm test -- tests/markdown-figures.test.ts tests/lilian-thinking.test.ts tests/fixtures.test.ts
58 tests passed

Fix clipping for malformed figure content

59db6d4

shuuul marked this pull request as ready for review March 9, 2026 12:43

Copilot AI review requested due to automatic review settings March 9, 2026 12:43

Copilot started reviewing on behalf of shuuul March 9, 2026 12:43 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

Address figure markdown review feedback

4219bb2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix clipping when malformed <figure> nests article content#151

Fix clipping when malformed <figure> nests article content#151
shuuul wants to merge 2 commits intokepano:mainfrom
shuuul:codex/fix-lilian-thinking-clipping

shuuul commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

shuuul Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

shuuul Mar 9, 2026

Uh oh!

shuuul commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shuuul commented Mar 9, 2026

Summary

Changes

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

shuuul Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

shuuul Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

shuuul commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants