fix(zhihu/download): author extraction, zhida links, redirect detection#54
Open
yrom wants to merge 2 commits into
Open
fix(zhihu/download): author extraction, zhida links, redirect detection#54yrom wants to merge 2 commits into
yrom wants to merge 2 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix multiple issues in the
zhihu downloadadapter.Fixes #40 — Author extraction returns "unknown"
Root cause: The combined CSS selector
.AuthorInfo-name, .UserLink-linkmatches elements in DOM order. A.UserLink-linkelement appearing before.AuthorInfo-namein the DOM returns empty text, causing the fallback chain to skip the real author name.Fix: Split into separate
querySelectorcalls with proper fallback order:.AuthorInfo-name(primary).UserLink-link(secondary)meta[itemprop="author"](stable meta tag fallback)meta[name="author"]js-initialDataJSON parsing (last resort)Fixes #42 — Template variable not filled + wrong article
Root cause 1: On failure (content element not found), the evaluate step returned an array
[{...}]instead of a plain object{...}. The download step's${{ data.title }}template couldn't resolve, leaving the literal string in the output.Fix: Return a plain object with all required fields (
filename,imageUrls,content,output) on error, matching the success case structure.Root cause 2: When an article URL is removed/not found, Zhihu redirects to the homepage. The adapter would then scrape the homepage content instead of reporting an error.
Fix: Detect redirect by checking
location.hostnameandlocation.pathnameafter navigation. Return a clear "Article not found" error.Additional improvements
zhida.zhihu.comlinks: These are Zhihu's internal AI search links that add noise to the markdown output. Now only the display text is kept.settleMsfrom 3000 to 5000 for more reliable page loading.pathcolumn to output table so users can see the downloaded file path.Test plan
autocli zhihu download "https://www.zhihu.com/question/351504112/answer/2027391723035275294"— author correctly shows "NGINX洪志道" (was "unknown")autocli zhihu download "https://zhuanlan.zhihu.com/p/60954299"— shows "Article not found" error (was${{ data.title }}literal)autocli zhihu download "https://zhuanlan.zhihu.com/p/61154299"— correct article fetchedautocli zhihu download "https://www.zhihu.com/question/2040832649963508039/answer/2041160812719498582"— new URL format works