Skip to content

feat: Site Recipe Engine (v2.2.0)#22

Merged
syswave-dev merged 26 commits intomainfrom
feat/recipe-engine-v2.2
May 6, 2026
Merged

feat: Site Recipe Engine (v2.2.0)#22
syswave-dev merged 26 commits intomainfrom
feat/recipe-engine-v2.2

Conversation

@syswave-dev
Copy link
Copy Markdown
Collaborator

@syswave-dev syswave-dev commented May 6, 2026

Summary

Implements the declarative site-recipes.json engine described in #18. Per-host preprocess, fetch, select, and extractor rules, applied during the extraction pipeline in lib/web.js. Default recipes ship in the repo; self-hosters can mount data/site-recipes.json or set PULLMD_SITE_RECIPES for their own.

Coordinated cross-component changes:

  • New module lib/recipes.js — Zod schema, file loader (default + user overlay, recipe-level rejection), host/path glob matcher, list+scalar merge logic, preprocess action engine, content-hash boot logic
  • Four hook points integrated into extractWeb() — match recipes, route fetch options, thread preprocess actions, forward Playwright sidecar fields
  • Cache invalidation: SHA256 of recipe content compared at boot via new meta table; on change, recipes_invalidated_at bumps and stale rows lazy-refresh
  • Playwright sidecar accepts new optional fields waitFor, waitTimeoutMs, mobileUa, userAgent — backwards compatible (old fields still work)
  • Sidecar bundles playwright-stealth for headless-detection mitigation
  • Sidecar honors PullMD's UA-rotation pool (was hardcoded before)
  • Public GET /api/recipes/status for monitoring (UptimeKuma-friendly)

Default recipes shipped

  • future-plc-paywall-aria + future-plc-recommendations — strip the aria-hidden paywall pattern and recommendation-widget clusters that Readability over-clusters into. Verified against article fixtures.
  • github-issues (path: /*/*/issues/*) — forces Playwright with wait_for: .js-comment-body so JS-rendered comments are captured. Verified end-to-end against a real GitHub issues page on staging.

Important — :latest stays on v1.x

The Docker :latest tag remains pinned to v1.x until the scheduled flip on 2026-05-16. Self-hosters wanting recipes must pin v2.2.0 explicitly for both images:

services:
  pullmd:
    image: aeternalabshq/pullmd:2.2.0
  playwright:
    image: aeternalabshq/pullmd-playwright:2.2.0

.github/workflows/docker.yml already gates :latest to v1.x tags only — verified, no CI change needed.

Known limitations (documented in CHANGELOG)

  • Sites behind cookie-based consent walls (third-party CMP frameworks like TCF v2) are not unlocked by recipes in this release. Such sites only return article content once HttpOnly cookies are set after a user click on the consent UI — out of scope for the recipe engine itself. A future release will add a fetch.cookies recipe field so operators can paste their own consent state when they choose to.

Test plan

  • All 471 tests pass via node --test
  • End-to-end smoke: recipe-driven aria-hidden + paywall class strip works against the deployed test instance
  • End-to-end smoke: a 3-segment GitHub-issues URL returns body + comments via Playwright with wait_for
  • /api/recipes/status returns {ok:true, loaded:3, ...} on the test instance
  • Cross-version compat: old PullMD + new sidecar (sidecar reads only url); new PullMD + old sidecar (extra fields silently ignored)
  • Path glob regression test pinned for /A/B/issues/N
  • Manual verification post-merge against Future-PLC URL on production-staging
  • Tag v2.2.0 — both images build via the existing workflow (pullmd + playwright-sidecar live in this same repo)

Thanks

Closes #18.

syswave-dev and others added 26 commits May 6, 2026 10:37
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…move

Thread resolved recipe through convertWithReadability: recipe.preprocess
actions are applied after generic preprocess(), and recipe.removeSelectors
are appended to cleanDom's REMOVE_SELECTORS for both call sites.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pick the UA once per request in extractWeb and pass it to renderClient
so static fetch and Playwright render share the same identity. renderViaSidecar
forwards it as `userAgent` in the POST body (omitted when undefined for
backwards compatibility).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…coded UA)

Add optional `userAgent` field to RenderRequest and `_render`. Desktop branch
uses `user_agent or USER_AGENT`; mobile branch is untouched — the iPhone
device profile always wins there.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…gments)

The recipe pattern was '/*/issues/*' which only matches '/foo/issues/N',
not GitHub's actual URL structure '/org/repo/issues/N' (3 segments). The
recipe never applied to real GitHub URLs, so issue conversions silently
fell through to the readability path with comments left JS-rendered and
absent from the markdown.

Reproduced against test instance with github.com//issues/10:
cache row had source=readability, indicating recipe wasn't matched.
Added a matcher test pinning the real-world path shape.
@syswave-dev syswave-dev force-pushed the feat/recipe-engine-v2.2 branch from 9db0053 to 04494e0 Compare May 6, 2026 16:57
@syswave-dev syswave-dev merged commit 91b69fe into main May 6, 2026
3 checks passed
@syswave-dev syswave-dev deleted the feat/recipe-engine-v2.2 branch May 6, 2026 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Architecture: Site recipe engine for per-site extraction rules

1 participant