Skip to content

fix(zhihu/download): author extraction, zhida links, redirect detection#54

Open
yrom wants to merge 2 commits into
nashsu:mainfrom
yrom:fix/zhihu-download
Open

fix(zhihu/download): author extraction, zhida links, redirect detection#54
yrom wants to merge 2 commits into
nashsu:mainfrom
yrom:fix/zhihu-download

Conversation

@yrom
Copy link
Copy Markdown

@yrom yrom commented May 22, 2026

Summary

Fix multiple issues in the zhihu download adapter.

Fixes #40 — Author extraction returns "unknown"

Root cause: The combined CSS selector .AuthorInfo-name, .UserLink-link matches elements in DOM order. A .UserLink-link element appearing before .AuthorInfo-name in the DOM returns empty text, causing the fallback chain to skip the real author name.

Fix: Split into separate querySelector calls with proper fallback order:

  1. .AuthorInfo-name (primary)
  2. .UserLink-link (secondary)
  3. meta[itemprop="author"] (stable meta tag fallback)
  4. meta[name="author"]
  5. js-initialData JSON parsing (last resort)

Fixes #42 — Template variable not filled + wrong article

Root cause 1: On failure (content element not found), the evaluate step returned an array [{...}] instead of a plain object {...}. The download step's ${{ data.title }} template couldn't resolve, leaving the literal string in the output.

Fix: Return a plain object with all required fields (filename, imageUrls, content, output) on error, matching the success case structure.

Root cause 2: When an article URL is removed/not found, Zhihu redirects to the homepage. The adapter would then scrape the homepage content instead of reporting an error.

Fix: Detect redirect by checking location.hostname and location.pathname after navigation. Return a clear "Article not found" error.

Additional improvements

  • Strip zhida.zhihu.com links: These are Zhihu's internal AI search links that add noise to the markdown output. Now only the display text is kept.
  • Increase settleMs from 3000 to 5000 for more reliable page loading.
  • Add path column to output table so users can see the downloaded file path.

Test plan

  • autocli zhihu download "https://www.zhihu.com/question/351504112/answer/2027391723035275294" — author correctly shows "NGINX洪志道" (was "unknown")
  • autocli zhihu download "https://zhuanlan.zhihu.com/p/60954299" — shows "Article not found" error (was ${{ data.title }} literal)
  • autocli zhihu download "https://zhuanlan.zhihu.com/p/61154299" — correct article fetched
  • autocli zhihu download "https://www.zhihu.com/question/2040832649963508039/answer/2041160812719498582" — new URL format works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

zhihu download: template variable bug and wrong article fetched zhihu download: author extraction returns "unknown" for answer pages

1 participant