feat: add generic web read command for any URL → Markdown#343
Merged
jackwener merged 2 commits intojackwener:mainfrom Mar 24, 2026
Merged
feat: add generic web read command for any URL → Markdown#343jackwener merged 2 commits intojackwener:mainfrom
web read command for any URL → Markdown#343jackwener merged 2 commits intojackwener:mainfrom
Conversation
Adds a new `opencli web read --url <any-url>` command that fetches any web page and exports it as clean Markdown with optional image download. Uses browser-side DOM heuristics for content extraction: 1. <article> element 2. [role="main"] element 3. <main> element 4. Largest text-dense block fallback Pipes through the existing article-download pipeline (Turndown + image localization), so it inherits code block handling, frontmatter generation, and concurrent image downloading for free. Tested on: Anthropic blog, OpenAI blog, general news sites. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Anthropic's blog renders each paragraph twice (a normal version + a line-broken animation version). The previous substring-based dedup missed these because whitespace differences changed string lengths. Fix: compare texts after stripping ALL whitespace, and keep the version with more proper spacing (more spaces = better formatted). Result on Anthropic blog: 98.4KB → 53.7KB (45% reduction). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Owner
|
Reviewed, fixed, and merged. I aligned the primary URL argument with the repo positional-arg convention, added a command-level test, and updated README/docs so the new adapter is discoverable. Re-ran |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
opencli web read --url <any-url>— a generic command to fetch any web page and export it as clean Markdown with optional local image download.Problem: OpenCLI has excellent adapters for specific sites (weixin, zhihu, reddit, etc.), but no way to handle arbitrary URLs. Users who want to save articles from sites without a dedicated adapter (Anthropic blog, OpenAI blog, personal blogs, news sites) have no built-in option.
Solution: A new
web/read.tsadapter that:<article>→[role="main"]→<main>→ largest text block fallback)article-download.tspipeline (Turndown + image download)Usage:
Options:
--url--output./web-articles--download-imagestrue--wait3Test Results
Implementation
Single file:
src/clis/web/read.ts(179 lines). Zero new dependencies — reuses existingarticle-download.tspipeline and Turndown.🤖 Generated with Claude Code