Skip to content

Fix --include-comments parsing for legacy S1 LJ themes#11

Merged
fantomc0der merged 4 commits into
mainfrom
fix-comments-legacy-themes
Apr 27, 2026
Merged

Fix --include-comments parsing for legacy S1 LJ themes#11
fantomc0der merged 4 commits into
mainfrom
fix-comments-legacy-themes

Conversation

@fantomc0der
Copy link
Copy Markdown
Owner

Motivation

--include-comments was producing markdown like this for older journals:

**Anonymous** — [Link](https://example.livejournal.com/123.html?thread=456#t456)

**()**

The comment count was right, but the user, timestamp, and body were all missing. Two underlying problems were combining to produce the bug:

  1. The b-tree-twig parser only matches modern LJ themes. Older S1 layouts wrap each comment in a <div id="ljcmt{thread}"> container — sometimes with a comment_bar_one header div + a sibling body div + a footer div, and sometimes with a single <table id="cmtbar{thread}"> containing all three sections — and neither shape was handled. The catch-all "thread links" fallback grabbed an arbitrary ?thread= anchor (typically the bare (Link) permalink), walked up to its container, and used that as the whole comment.
  2. On modern themes the default ?view=comments URL serves an empty <div class="b-tree b-tree-root"> and loads the comment tree via JS, so the modern parser also returned zero comments unless the journal happened to be a permanent S1 layout.

Approach

  • Fetch the static comment view. buildCommentUrl now appends ?nojs=1&view=comments, which is the same fallback URL LJ's <noscript> meta-refresh points at. This forces modern themes to render the full server-side tree and keeps S1 themes working unchanged.
  • Add a dedicated S1 parser. extractCommentsFromHtml now branches: modern (.b-tree-twig) → existing parser; otherwise look for [id^="ljcmt"] containers and run a new parseLegacyComment that pulls each field from a stable, theme-independent source — username from span.ljuser[data-ljuser], profile URL from a.i-ljuser-username, timestamp from the visible text of the title-bearing <span> in the comment header, permalink from the ?thread={id} anchor that matches this comment's id, depth from the margin-left: Npx style on the container (25 px per level on S1 themes), and the body via two layout-specific strategies.
  • Body extraction handles both S1 layouts. For comment_bar_one themes the body is a sibling div of the header; the parser strips .comment_bar_one/.comment_bar_alt plus footer divs whose only content is reply/parent/thread links. For cmtbar{id} themes the entire comment lives in one table; the parser walks only the outer rows, skips header rows (which contain the metadata sub-table) and footer rows (Reply/Parent/Thread links), and replaces the table with the body row's content.
  • Tests use HTML fixtures derived from real S1 markup (with placeholder usernames per the repo's hard rule), covering both S1 variants, the anonymous-comment case, depth extraction from margin-left, and the existing modern-theme expectations. The previous loose extractLegacyComments fallback is replaced — the explicit S1 parser is strictly better in every case it used to handle.

Verification

  • bun test — 124 pass, 0 fail.
  • bun run typecheck — clean.
  • End-to-end manual checks against the three example URLs in the bug report (deleted the day file in archive/ and re-ran archive --start-date YYYY-MM-DD --days 1 --include-comments):
    • S1 comment_bar_one (single comment + dual-comment day): each entry now produces the expected **[user](profile)** — [date](permalink) line followed by the actual comment body.
    • Modern b-tree-twig (8-comment thread, 4 levels deep): every comment is captured with correct >-prefixed nesting.
    • S1 cmtbar{id} (4-comment chain, depth 0→3): all four comments render with correct user, date, body, and depth.

Test plan

  • Unit tests cover modern theme + both S1 layouts + anonymous + depth math.
  • CLI run for an S1 comment_bar_one day produces user/date/body for every comment.
  • CLI run for a modern theme day with deep nesting produces correctly indented threads.
  • CLI run for a cmtbar{id} layout produces user/date/body for every comment with correct depth.

🤖 Generated with Claude Code

Comment pages on older themes use ljcmt{id} containers (with either comment_bar_one headers or cmtbar{id} table wrappers) instead of b-tree-twig. The previous fallback grabbed the wrong anchor, producing Anonymous plus a bare "Link" timestamp with no body. Modern themes were also affected because the default ?view=comments URL renders an empty b-tree-root and loads comments via JS.

- Switch the comment URL to ?nojs=1&view=comments so modern themes serve the static, server-rendered comment tree.
- Add a dedicated S1 parser for ljcmt{id} divs that handles both comment_bar_one and cmtbar{id} layouts, with depth from margin-left, username from data-ljuser, profile URL from the i-ljuser-username anchor, timestamp from the title-bearing span, and per-layout body extraction.
- Update fixtures and tests to cover both S1 variants plus an anonymous-comment case.
@fantomc0der fantomc0der marked this pull request as draft April 27, 2026 22:14
@fantomc0der
Copy link
Copy Markdown
Owner Author

Auto-converted PR to Draft. Mark as Ready for Review when the PR is finished to trigger Claude review and the auto-merge gate.

@github-actions
Copy link
Copy Markdown
Contributor

TypeScript Review

✅ No TypeScript errors were found in these changes

@fantomc0der fantomc0der marked this pull request as ready for review April 27, 2026 22:14
@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Claude finished @fantomc0der's task in 1m 58s —— View job


Tasks

  • Gather context (PR description, changed files)
  • Read src/scrapers/comments.ts (primary change)
  • Read tests/scrapers/comments.test.ts (test coverage)
  • Check CLAUDE.md hard rules compliance
  • Verify previous review concerns are addressed
  • Post review verdict

Review Summary

This PR replaces a catch-all fallback (extractLegacyComments) with a precise, two-branch dispatcher — modern b-tree-twig and S1 legacy [id^="ljcmt"] — and fixes the core bug (empty comment bodies on S1 themes) with a dedicated parseLegacyComment that pulls each field from stable, theme-independent DOM nodes.

All three concerns from the previous review round are resolved

rowHasReplyOrParentLink Link arm (comments.ts:318-320) — the implementation now gates the Link/Parent/Thread text match behind [?&]thread= in the href, exactly as the suggestion proposed. A new test ("does not treat user-authored <a>Link</a> in body as a footer row") locks in the correct behavior for a user-authored <a href="https://example.com">Link</a> anchor.

divIsLegacyFooter replaceAll (comments.ts:299)replaceAll is used, not replace. Not an issue.

findLegacyTimestamp scoping (comments.ts:192-198) — the function scopes to .comment_bar_one, .comment_bar_alt for the first S1 layout and to the first <tr> of the cmtbar sub-table for the second. The body-shadowing test ("ignores year-bearing spans in the body when extracting timestamp") demonstrates this works correctly.

Code quality observations

a[name^="t"] removal is slightly over-broad (comments.ts:219). The selector removes named anchors whose name starts with "t". LJ inserts <a name="t{number}"> as a jump target, but a user-authored anchor like <a name="top"> or <a name="terms"> in the comment body would also be removed. In practice, named anchors in LJ comment bodies are vanishingly rare, and the strip is purely a cleanup step (the anchor itself carries no visible content), so this is low-impact. A tighter selector such as a[name^="t"]:not([href]) or a regex guard on the name value would be strictly safer, but it is not worth blocking the PR over.

findLegacyPermalink regex uses threadNumeric directly in a RegExp constructor (comments.ts:172). threadNumeric is the suffix of an id="ljcmt..." attribute after stripping the prefix, so for any real LJ page it will be purely numeric and safe. If a hand-crafted HTML page injected regex metacharacters, the constructor would throw. An escapeRegExp guard would be defensive, but the input is server-generated HTML from a trusted source, so this is acceptable as-is.

$table.find("> tbody > tr, …, > tr") selector (comments.ts:231). The combined selector covers both the case where htmlparser2 inserts an implicit <tbody> and the case where the raw HTML has <tr> as a direct child. The test fixtures exercise this path successfully.

CLAUDE.md hard rules

Rule Status
No real usernames ✓ Fixtures use commenter1commenter11
No hard wrapping ✓ All prose is soft-wrapped
No @ts-ignore / as any ✓ Clean TypeScript throughout
Docs in sync ✓ Both AGENTS.md and README.md updated

Test coverage

The test file adds three buildCommentUrl tests and nine extractCommentsFromHtml tests covering: both S1 layouts, anonymous comments in both layouts, depth from margin-left, timestamp scoping with a year-bearing body span, thread-ID substring safety, and the user-authored <a>Link</a> body row. This is thorough coverage for the new code paths.

REVIEW: PASS

Comment thread src/scrapers/comments.ts
const href = $a.attr("href") ?? "";
if (/[?&]replyto=|[?&]mode=reply/i.test(href)) return true;
const text = ($a.text() ?? "").trim();
return /^(?:Reply|Parent|Thread|Link)$/i.test(text);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching "Link" as a footer indicator is safe in the cmtbar layout because the Link anchor lives inside the header sub-table and is excluded by the sub-table guard on line 226 before this function is ever called. But the comment explaining that guard references the sub-table check — a brief inline note here would make the reason explicit so a future reader doesn't strip the "Link" arm thinking it's overly broad.

Comment thread src/scrapers/comments.ts Outdated
let nonLinkText = ($d.text() ?? "").trim();
$links.toArray().forEach((a) => {
const linkText = ($(a as Element).text() ?? "").trim();
if (linkText) nonLinkText = nonLinkText.replace(linkText, "");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String.replace(linkText, "") only removes the first occurrence of linkText. If the same word appears multiple times as link text in a footer div, the second occurrence stays in nonLinkText, the length check on line 290 sees a non-zero remainder, and the div is not identified as a footer — meaning reply links stay in the body. In practice LJ footer divs have distinct link labels (Reply, Thread, Parent), so this shouldn't trigger, but replaceAll would be more defensive here.

Comment thread src/scrapers/comments.ts Outdated
// Visible text contains a 4-digit year; the title contains a relative phrase
// like "11 minutes after journal entry".
let timestampText = "";
$cmt.find("span[title]").each((_i, sp) => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findLegacyTimestamp scans the entire $cmt subtree for any span[title] whose visible text contains a four-digit year. It stops at the first match, which is normally the metadata header span. However, if a comment body contains user-authored text like <span>Happened in 2003</span>, that span would also match — and if it appears before the header span in DOM order, the timestamp would be wrong. The risk is low (LJ's header span appears first in both S1 layouts), but scoping the search to the header region (.comment_bar_one or the first <tr> of the cmtbar sub-table) would be more robust.

- Update README.md Architecture section so it agrees with AGENTS.md and CLAUDE.md (hard rule: docs in sync).
- Scope findLegacyTimestamp to the comment header region (comment_bar_one or the cmtbar header row) so a year-bearing span in the body cannot shadow the real timestamp.
- Use replaceAll in divIsLegacyFooter so a footer with the same link text twice is still recognised.
- Add an inline note in rowHasReplyOrParentLink explaining that the "Link" arm is safe because the cmtbar header sub-table is filtered by the caller before this check runs.
- Add tests for depth>0 in the comment_bar_one variant and for the year-in-body shadowing case.
- Replace href.includes("thread=N") substring check in extractLegacyBody with the same word-boundary regex used in findLegacyPermalink, so a body link to ?thread=500 isn't stripped when the surrounding comment id is t5.
- Move the comment_bar_one anonymous-comment test out of the cmtbar describe block (where it had been mis-filed) into the comment_bar_one block where its fixture belongs.
- Add a true cmtbar-layout anonymous-comment test so both S1 variants now have parity coverage.
- Add a regression test for the substring-bug fix.
Comment thread src/scrapers/comments.ts
const href = $a.attr("href") ?? "";
if (/[?&]replyto=|[?&]mode=reply/i.test(href)) return true;
const text = ($a.text() ?? "").trim();
return /^(?:Reply|Parent|Thread|Link)$/i.test(text);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Link arm here is the one worth reconsidering. The inline comment on lines 309–311 explains why the LJ-generated (Link) permalink can't appear in a non-header row — it's inside the metadata sub-table, which is already excluded by the nested-table guard above. So this arm is never needed to catch LJ-generated links.

What it does catch is any user-authored body anchor whose visible text happens to be the single word "Link" (case-insensitive). For example, a commenter who wrote:

<a href="https://example.com">Link</a> for more context

would cause rowHasReplyOrParentLink to return true, the row would be skipped, and the comment body would be silently empty.

The fix is to scope the Link arm to anchors that also have thread= in their href (which is how LJ always generates the permalink):

Suggested change
return /^(?:Reply|Parent|Thread|Link)$/i.test(text);
if (/[?&]replyto=|[?&]mode=reply/i.test(href)) return true;
if (/[?&]thread=/.test(href) && /^Link$/i.test(($a.text() ?? "").trim())) return true;
const text = ($a.text() ?? "").trim();
return /^(?:Reply|Parent|Thread)$/i.test(text);

No existing test covers a cmtbar body row with a user-authored <a>Link</a> — adding one would lock in the correct behavior.

- Gate the Parent/Thread/Link text-match arm in rowHasReplyOrParentLink on a thread= href so a user-authored anchor with text "Link" pointing at an unrelated URL can't be misclassified as an LJ footer row and silently strip the comment body.
- Add a regression test exercising that scenario in the cmtbar layout.
- Note the body-shadow risk on the last-resort fallback in findLegacyTimestamp so future readers understand the trade-off.
@fantomc0der fantomc0der merged commit 13ca873 into main Apr 27, 2026
4 checks passed
@fantomc0der fantomc0der deleted the fix-comments-legacy-themes branch April 27, 2026 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant