feat: add generic `web read` command for any URL → Markdown by echo-hhj · Pull Request #343 · jackwener/opencli

echo-hhj · 2026-03-24T05:28:32Z

Summary

Adds opencli web read --url <any-url> — a generic command to fetch any web page and export it as clean Markdown with optional local image download.

Problem: OpenCLI has excellent adapters for specific sites (weixin, zhihu, reddit, etc.), but no way to handle arbitrary URLs. Users who want to save articles from sites without a dedicated adapter (Anthropic blog, OpenAI blog, personal blogs, news sites) have no built-in option.

Solution: A new web/read.ts adapter that:

Uses browser-side DOM heuristics to extract main content (<article> → [role="main"] → <main> → largest text block fallback)
Cleans noise elements (nav, footer, sidebar, ads, comments) before extraction
Extracts metadata (title from og:title/title/h1, author from meta tags, publish time)
Pipes through the existing article-download.ts pipeline (Turndown + image download)

Usage:

opencli web read --url "https://www.anthropic.com/research/..." --output ./articles
opencli web read --url "https://openai.com/index/..." --download-images false
opencli web read --url "https://any-website.com/post" --wait 5

Options:

Option	Default	Description
`--url`	(required)	Any web page URL
`--output`	`./web-articles`	Output directory
`--download-images`	`true`	Download images locally
`--wait`	`3`	Seconds to wait after page load

Test Results

Site	Status	Output
Anthropic blog	✅	98.4 KB markdown + images
OpenAI blog	✅	16.8 KB clean markdown
General news sites	✅	Works with standard article layouts

Implementation

Single file: src/clis/web/read.ts (179 lines). Zero new dependencies — reuses existing article-download.ts pipeline and Turndown.

🤖 Generated with Claude Code

Adds a new `opencli web read --url <any-url>` command that fetches any web page and exports it as clean Markdown with optional image download. Uses browser-side DOM heuristics for content extraction: 1. <article> element 2. [role="main"] element 3. <main> element 4. Largest text-dense block fallback Pipes through the existing article-download pipeline (Turndown + image localization), so it inherits code block handling, frontmatter generation, and concurrent image downloading for free. Tested on: Anthropic blog, OpenAI blog, general news sites. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Anthropic's blog renders each paragraph twice (a normal version + a line-broken animation version). The previous substring-based dedup missed these because whitespace differences changed string lengths. Fix: compare texts after stripping ALL whitespace, and keep the version with more proper spacing (more spaces = better formatted). Result on Anthropic blog: 98.4KB → 53.7KB (45% reduction). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jackwener · 2026-03-24T06:57:37Z

Reviewed, fixed, and merged. I aligned the primary URL argument with the repo positional-arg convention, added a command-level test, and updated README/docs so the new adapter is discoverable. Re-ran npm run typecheck, a targeted web/read Vitest config, npm test, npm run build, and npm run docs:build locally.

Harrison and others added 2 commits March 24, 2026 13:27

jackwener merged commit f6466db into jackwener:main Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add generic `web read` command for any URL → Markdown#343

feat: add generic `web read` command for any URL → Markdown#343
jackwener merged 2 commits intojackwener:mainfrom
echo-hhj:feat/web-read

echo-hhj commented Mar 24, 2026

Uh oh!

jackwener commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

echo-hhj commented Mar 24, 2026

Summary

Test Results

Implementation

Uh oh!

jackwener commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants