Fix --include-comments parsing for legacy S1 LJ themes#11
Conversation
Comment pages on older themes use ljcmt{id} containers (with either comment_bar_one headers or cmtbar{id} table wrappers) instead of b-tree-twig. The previous fallback grabbed the wrong anchor, producing Anonymous plus a bare "Link" timestamp with no body. Modern themes were also affected because the default ?view=comments URL renders an empty b-tree-root and loads comments via JS.
- Switch the comment URL to ?nojs=1&view=comments so modern themes serve the static, server-rendered comment tree.
- Add a dedicated S1 parser for ljcmt{id} divs that handles both comment_bar_one and cmtbar{id} layouts, with depth from margin-left, username from data-ljuser, profile URL from the i-ljuser-username anchor, timestamp from the title-bearing span, and per-layout body extraction.
- Update fixtures and tests to cover both S1 variants plus an anonymous-comment case.
|
Auto-converted PR to Draft. Mark as Ready for Review when the PR is finished to trigger Claude review and the auto-merge gate. |
TypeScript Review✅ No TypeScript errors were found in these changes |
|
Claude finished @fantomc0der's task in 1m 58s —— View job Tasks
Review SummaryThis PR replaces a catch-all fallback ( All three concerns from the previous review round are resolved
Code quality observations
CLAUDE.md hard rules
Test coverageThe test file adds three REVIEW: PASS |
| const href = $a.attr("href") ?? ""; | ||
| if (/[?&]replyto=|[?&]mode=reply/i.test(href)) return true; | ||
| const text = ($a.text() ?? "").trim(); | ||
| return /^(?:Reply|Parent|Thread|Link)$/i.test(text); |
There was a problem hiding this comment.
Matching "Link" as a footer indicator is safe in the cmtbar layout because the Link anchor lives inside the header sub-table and is excluded by the sub-table guard on line 226 before this function is ever called. But the comment explaining that guard references the sub-table check — a brief inline note here would make the reason explicit so a future reader doesn't strip the "Link" arm thinking it's overly broad.
| let nonLinkText = ($d.text() ?? "").trim(); | ||
| $links.toArray().forEach((a) => { | ||
| const linkText = ($(a as Element).text() ?? "").trim(); | ||
| if (linkText) nonLinkText = nonLinkText.replace(linkText, ""); |
There was a problem hiding this comment.
String.replace(linkText, "") only removes the first occurrence of linkText. If the same word appears multiple times as link text in a footer div, the second occurrence stays in nonLinkText, the length check on line 290 sees a non-zero remainder, and the div is not identified as a footer — meaning reply links stay in the body. In practice LJ footer divs have distinct link labels (Reply, Thread, Parent), so this shouldn't trigger, but replaceAll would be more defensive here.
| // Visible text contains a 4-digit year; the title contains a relative phrase | ||
| // like "11 minutes after journal entry". | ||
| let timestampText = ""; | ||
| $cmt.find("span[title]").each((_i, sp) => { |
There was a problem hiding this comment.
findLegacyTimestamp scans the entire $cmt subtree for any span[title] whose visible text contains a four-digit year. It stops at the first match, which is normally the metadata header span. However, if a comment body contains user-authored text like <span>Happened in 2003</span>, that span would also match — and if it appears before the header span in DOM order, the timestamp would be wrong. The risk is low (LJ's header span appears first in both S1 layouts), but scoping the search to the header region (.comment_bar_one or the first <tr> of the cmtbar sub-table) would be more robust.
- Update README.md Architecture section so it agrees with AGENTS.md and CLAUDE.md (hard rule: docs in sync). - Scope findLegacyTimestamp to the comment header region (comment_bar_one or the cmtbar header row) so a year-bearing span in the body cannot shadow the real timestamp. - Use replaceAll in divIsLegacyFooter so a footer with the same link text twice is still recognised. - Add an inline note in rowHasReplyOrParentLink explaining that the "Link" arm is safe because the cmtbar header sub-table is filtered by the caller before this check runs. - Add tests for depth>0 in the comment_bar_one variant and for the year-in-body shadowing case.
- Replace href.includes("thread=N") substring check in extractLegacyBody with the same word-boundary regex used in findLegacyPermalink, so a body link to ?thread=500 isn't stripped when the surrounding comment id is t5.
- Move the comment_bar_one anonymous-comment test out of the cmtbar describe block (where it had been mis-filed) into the comment_bar_one block where its fixture belongs.
- Add a true cmtbar-layout anonymous-comment test so both S1 variants now have parity coverage.
- Add a regression test for the substring-bug fix.
| const href = $a.attr("href") ?? ""; | ||
| if (/[?&]replyto=|[?&]mode=reply/i.test(href)) return true; | ||
| const text = ($a.text() ?? "").trim(); | ||
| return /^(?:Reply|Parent|Thread|Link)$/i.test(text); |
There was a problem hiding this comment.
The Link arm here is the one worth reconsidering. The inline comment on lines 309–311 explains why the LJ-generated (Link) permalink can't appear in a non-header row — it's inside the metadata sub-table, which is already excluded by the nested-table guard above. So this arm is never needed to catch LJ-generated links.
What it does catch is any user-authored body anchor whose visible text happens to be the single word "Link" (case-insensitive). For example, a commenter who wrote:
<a href="https://example.com">Link</a> for more contextwould cause rowHasReplyOrParentLink to return true, the row would be skipped, and the comment body would be silently empty.
The fix is to scope the Link arm to anchors that also have thread= in their href (which is how LJ always generates the permalink):
| return /^(?:Reply|Parent|Thread|Link)$/i.test(text); | |
| if (/[?&]replyto=|[?&]mode=reply/i.test(href)) return true; | |
| if (/[?&]thread=/.test(href) && /^Link$/i.test(($a.text() ?? "").trim())) return true; | |
| const text = ($a.text() ?? "").trim(); | |
| return /^(?:Reply|Parent|Thread)$/i.test(text); |
No existing test covers a cmtbar body row with a user-authored <a>Link</a> — adding one would lock in the correct behavior.
- Gate the Parent/Thread/Link text-match arm in rowHasReplyOrParentLink on a thread= href so a user-authored anchor with text "Link" pointing at an unrelated URL can't be misclassified as an LJ footer row and silently strip the comment body. - Add a regression test exercising that scenario in the cmtbar layout. - Note the body-shadow risk on the last-resort fallback in findLegacyTimestamp so future readers understand the trade-off.
Motivation
--include-commentswas producing markdown like this for older journals:The comment count was right, but the user, timestamp, and body were all missing. Two underlying problems were combining to produce the bug:
<div id="ljcmt{thread}">container — sometimes with acomment_bar_oneheader div + a sibling body div + a footer div, and sometimes with a single<table id="cmtbar{thread}">containing all three sections — and neither shape was handled. The catch-all "thread links" fallback grabbed an arbitrary?thread=anchor (typically the bare(Link)permalink), walked up to its container, and used that as the whole comment.?view=commentsURL serves an empty<div class="b-tree b-tree-root">and loads the comment tree via JS, so the modern parser also returned zero comments unless the journal happened to be a permanent S1 layout.Approach
buildCommentUrlnow appends?nojs=1&view=comments, which is the same fallback URL LJ's<noscript>meta-refresh points at. This forces modern themes to render the full server-side tree and keeps S1 themes working unchanged.extractCommentsFromHtmlnow branches: modern (.b-tree-twig) → existing parser; otherwise look for[id^="ljcmt"]containers and run a newparseLegacyCommentthat pulls each field from a stable, theme-independent source — username fromspan.ljuser[data-ljuser], profile URL froma.i-ljuser-username, timestamp from the visible text of the title-bearing<span>in the comment header, permalink from the?thread={id}anchor that matches this comment's id, depth from themargin-left: Npxstyle on the container (25 px per level on S1 themes), and the body via two layout-specific strategies.comment_bar_onethemes the body is a sibling div of the header; the parser strips.comment_bar_one/.comment_bar_altplus footer divs whose only content is reply/parent/thread links. Forcmtbar{id}themes the entire comment lives in one table; the parser walks only the outer rows, skips header rows (which contain the metadata sub-table) and footer rows (Reply/Parent/Thread links), and replaces the table with the body row's content.margin-left, and the existing modern-theme expectations. The previous looseextractLegacyCommentsfallback is replaced — the explicit S1 parser is strictly better in every case it used to handle.Verification
bun test— 124 pass, 0 fail.bun run typecheck— clean.archive/and re-ranarchive --start-date YYYY-MM-DD --days 1 --include-comments):comment_bar_one(single comment + dual-comment day): each entry now produces the expected**[user](profile)** — [date](permalink)line followed by the actual comment body.b-tree-twig(8-comment thread, 4 levels deep): every comment is captured with correct>-prefixed nesting.cmtbar{id}(4-comment chain, depth 0→3): all four comments render with correct user, date, body, and depth.Test plan
comment_bar_oneday produces user/date/body for every comment.cmtbar{id}layout produces user/date/body for every comment with correct depth.🤖 Generated with Claude Code