feat: Site Recipe Engine (v2.2.0)#22
Merged
syswave-dev merged 26 commits intomainfrom May 6, 2026
Merged
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…move Thread resolved recipe through convertWithReadability: recipe.preprocess actions are applied after generic preprocess(), and recipe.removeSelectors are appended to cleanDom's REMOVE_SELECTORS for both call sites. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pick the UA once per request in extractWeb and pass it to renderClient so static fetch and Playwright render share the same identity. renderViaSidecar forwards it as `userAgent` in the POST body (omitted when undefined for backwards compatibility). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…coded UA) Add optional `userAgent` field to RenderRequest and `_render`. Desktop branch uses `user_agent or USER_AGENT`; mobile branch is untouched — the iPhone device profile always wins there. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…gments) The recipe pattern was '/*/issues/*' which only matches '/foo/issues/N', not GitHub's actual URL structure '/org/repo/issues/N' (3 segments). The recipe never applied to real GitHub URLs, so issue conversions silently fell through to the readability path with comments left JS-rendered and absent from the markdown. Reproduced against test instance with github.com//issues/10: cache row had source=readability, indicating recipe wasn't matched. Added a matcher test pinning the real-world path shape.
9db0053 to
04494e0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the declarative site-recipes.json engine described in #18. Per-host preprocess, fetch, select, and extractor rules, applied during the extraction pipeline in
lib/web.js. Default recipes ship in the repo; self-hosters can mountdata/site-recipes.jsonor setPULLMD_SITE_RECIPESfor their own.Coordinated cross-component changes:
lib/recipes.js— Zod schema, file loader (default + user overlay, recipe-level rejection), host/path glob matcher, list+scalar merge logic, preprocess action engine, content-hash boot logicextractWeb()— match recipes, route fetch options, thread preprocess actions, forward Playwright sidecar fieldsmetatable; on change,recipes_invalidated_atbumps and stale rows lazy-refreshwaitFor,waitTimeoutMs,mobileUa,userAgent— backwards compatible (old fields still work)playwright-stealthfor headless-detection mitigationGET /api/recipes/statusfor monitoring (UptimeKuma-friendly)Default recipes shipped
future-plc-paywall-aria+future-plc-recommendations— strip thearia-hiddenpaywall pattern and recommendation-widget clusters that Readability over-clusters into. Verified against article fixtures.github-issues(path:/*/*/issues/*) — forces Playwright withwait_for: .js-comment-bodyso JS-rendered comments are captured. Verified end-to-end against a real GitHub issues page on staging.Important —
:lateststays on v1.xThe Docker
:latesttag remains pinned to v1.x until the scheduled flip on 2026-05-16. Self-hosters wanting recipes must pin v2.2.0 explicitly for both images:.github/workflows/docker.ymlalready gates:latestto v1.x tags only — verified, no CI change needed.Known limitations (documented in CHANGELOG)
fetch.cookiesrecipe field so operators can paste their own consent state when they choose to.Test plan
node --testaria-hidden+ paywall class strip works against the deployed test instancewait_for/api/recipes/statusreturns{ok:true, loaded:3, ...}on the test instanceurl); new PullMD + old sidecar (extra fields silently ignored)/A/B/issues/NThanks
WinFuture23/real-world-user-agentsfeed already wired intolib/user-agent.js. The recipe engine is the next-step generalization of that work.Closes #18.