Skip to content

feat: add generic web read command for any URL → Markdown#343

Merged
jackwener merged 2 commits intojackwener:mainfrom
echo-hhj:feat/web-read
Mar 24, 2026
Merged

feat: add generic web read command for any URL → Markdown#343
jackwener merged 2 commits intojackwener:mainfrom
echo-hhj:feat/web-read

Conversation

@echo-hhj
Copy link
Contributor

Summary

Adds opencli web read --url <any-url> — a generic command to fetch any web page and export it as clean Markdown with optional local image download.

Problem: OpenCLI has excellent adapters for specific sites (weixin, zhihu, reddit, etc.), but no way to handle arbitrary URLs. Users who want to save articles from sites without a dedicated adapter (Anthropic blog, OpenAI blog, personal blogs, news sites) have no built-in option.

Solution: A new web/read.ts adapter that:

  • Uses browser-side DOM heuristics to extract main content (<article>[role="main"]<main> → largest text block fallback)
  • Cleans noise elements (nav, footer, sidebar, ads, comments) before extraction
  • Extracts metadata (title from og:title/title/h1, author from meta tags, publish time)
  • Pipes through the existing article-download.ts pipeline (Turndown + image download)

Usage:

opencli web read --url "https://www.anthropic.com/research/..." --output ./articles
opencli web read --url "https://openai.com/index/..." --download-images false
opencli web read --url "https://any-website.com/post" --wait 5

Options:

Option Default Description
--url (required) Any web page URL
--output ./web-articles Output directory
--download-images true Download images locally
--wait 3 Seconds to wait after page load

Test Results

Site Status Output
Anthropic blog 98.4 KB markdown + images
OpenAI blog 16.8 KB clean markdown
General news sites Works with standard article layouts

Implementation

Single file: src/clis/web/read.ts (179 lines). Zero new dependencies — reuses existing article-download.ts pipeline and Turndown.

🤖 Generated with Claude Code

Harrison and others added 2 commits March 24, 2026 13:27
Adds a new `opencli web read --url <any-url>` command that fetches any
web page and exports it as clean Markdown with optional image download.

Uses browser-side DOM heuristics for content extraction:
  1. <article> element
  2. [role="main"] element
  3. <main> element
  4. Largest text-dense block fallback

Pipes through the existing article-download pipeline (Turndown + image
localization), so it inherits code block handling, frontmatter generation,
and concurrent image downloading for free.

Tested on: Anthropic blog, OpenAI blog, general news sites.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Anthropic's blog renders each paragraph twice (a normal version + a
line-broken animation version). The previous substring-based dedup
missed these because whitespace differences changed string lengths.

Fix: compare texts after stripping ALL whitespace, and keep the
version with more proper spacing (more spaces = better formatted).

Result on Anthropic blog: 98.4KB → 53.7KB (45% reduction).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jackwener jackwener merged commit f6466db into jackwener:main Mar 24, 2026
@jackwener
Copy link
Owner

Reviewed, fixed, and merged. I aligned the primary URL argument with the repo positional-arg convention, added a command-level test, and updated README/docs so the new adapter is discoverable. Re-ran npm run typecheck, a targeted web/read Vitest config, npm test, npm run build, and npm run docs:build locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants