Skip to content

Fix clipping when malformed <figure> nests article content#151

Open
shuuul wants to merge 2 commits intokepano:mainfrom
shuuul:codex/fix-lilian-thinking-clipping
Open

Fix clipping when malformed <figure> nests article content#151
shuuul wants to merge 2 commits intokepano:mainfrom
shuuul:codex/fix-lilian-thinking-clipping

Conversation

@shuuul
Copy link

@shuuul shuuul commented Mar 9, 2026

Summary

This fixes a regression where Defuddle clipped the Lilian Weng "Why We Think" article partway through extraction.

The root cause was a malformed or unclosed <figure> in the source HTML. Figure normalization treated the node as a standard image figure, which caused later article content nested inside that figure to be collapsed during markdown conversion.

Changes

  • skip figure standardization when a <figure> contains non-image content that suggests malformed HTML
  • preserve remaining non-image content inside a figure after extracting the image and caption to markdown
  • add a regression fixture for https://lilianweng.github.io/posts/2025-05-01-thinking/
  • add snapshot coverage for the full extracted markdown
  • add a focused regression test asserting the article is extracted through the citation and references sections

Testing

  • npm test -- tests/lilian-thinking.test.ts tests/fixtures.test.ts

@shuuul shuuul marked this pull request as ready for review March 9, 2026 12:43
Copilot AI review requested due to automatic review settings March 9, 2026 12:43
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a regression where extraction/markdown conversion was clipped when malformed/unclosed <figure> tags caused subsequent article content to be nested inside a figure.

Changes:

  • Skip figure standardization when a <figure> contains apparent non-image block content (malformed HTML signal).
  • Update markdown figure conversion to preserve any remaining non-image content after emitting the image + caption.
  • Add Lilian Weng “Why We Think” regression fixture + expected markdown snapshot, plus a focused regression test.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/markdown.ts Extends the Turndown <figure> rule to emit image+caption and then convert/preserve any remaining nested content.
src/elements/images.ts Adds a malformed-figure detector (hasNonImageContent) to bypass standardization when a figure contains apparent non-image content.
tests/lilian-thinking.test.ts Adds a focused regression test ensuring extraction reaches Citation/References and meets a minimum word count.
tests/fixtures/general--lilianweng.github.io:posts:2025-05-01-thinking:.html Adds the regression HTML fixture exhibiting malformed/unclosed <figure> nesting behavior.
tests/expected/general--lilianweng.github.io:posts:2025-05-01-thinking:.md Adds the expected extracted markdown snapshot for the new fixture.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/markdown.ts Outdated
Comment on lines +279 to +286
const cloneImg = clone.querySelector('img');
if (cloneImg) {
const parentLink = cloneImg.closest('a');
if (parentLink && parentLink !== clone) {
parentLink.remove();
} else {
cloneImg.remove();
}
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the clone cleanup, removing the closest parent (parentLink.remove()) will drop any non-image content inside that link (e.g. text nodes or other elements) and defeats the goal of preserving non-image content from malformed figures. Consider removing just the (or unwrapping the while preserving its other children) instead of removing the entire anchor element.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 4219bb2. The cleanup now removes only the cloned <img> instead of dropping the entire parent link, so linked text/content inside malformed figures is preserved. I also added a focused regression test in tests/markdown-figures.test.ts covering this case.

src/markdown.ts Outdated
Comment on lines +273 to +277
let imageMarkdown = `![${alt}](${src})\n\n${caption}\n\n`;

// Preserve non-image content inside the figure (e.g. when an unclosed
// <figure> tag causes subsequent content to be nested inside it)
const clone = node.cloneNode(true) as Element;
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rule now always clones the entire

, serializes it, and runs an extra Turndown pass to look for remaining content. For the common case of well-formed figures (only image+caption), this adds avoidable work during markdown conversion. Consider short-circuiting unless the figure appears malformed / contains extra non-image nodes (e.g. a quick structural check before cloning), or only running the extra conversion when leftover content is actually possible.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 4219bb2. I added a hasResidualFigureContent guard so well-formed image+caption figures return early and skip the extra clone/serialization/Turndown pass. The malformed-figure regression still passes.

@shuuul
Copy link
Author

shuuul commented Mar 9, 2026

Posted a follow-up commit to address the Copilot review comments: 4219bb2.

Changes in that commit:

  • preserve linked text/content when stripping the cloned figure image
  • short-circuit the extra figure cleanup path unless residual non-image content is present
  • add a focused regression test for linked content inside a malformed figure

Verification:

  • npm test -- tests/markdown-figures.test.ts tests/lilian-thinking.test.ts tests/fixtures.test.ts
  • 58 tests passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants