diff --git a/CHANGELOG.md b/CHANGELOG.md index f1c3c5e..4b737c8 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,42 @@ # Changelog +## v2.2.0 — 2026-05-XX + +### Added + +- **Site Recipe Engine** (#18). Declarative `site-recipes.json` for per-host preprocess, fetch, select, and extractor rules. Default recipes ship in the repo (`site-recipes.default.json`); self-hosters can mount `data/site-recipes.json` or set `PULLMD_SITE_RECIPES` to point elsewhere. Four recipe categories: + - `preprocess` — DOM cleanup actions (`remove-attr`, `remove-class`, `remove-element`, `unwrap`) applied before extraction + - `fetch` — render forcing (`render: force|skip`), wait-for selector, mobile UA + - `select` — extra remove-selectors added to `cleanDom` + - `extractor` — preferred extractor per host (`readability`, `trafilatura`, `playwright`) +- New endpoint `GET /api/recipes/status` (public, no auth) — counts loaded/rejected recipes per source for monitoring. +- Cache invalidation on recipe change. When recipe content changes between server boots, all cache rows become stale and re-extract on next access (lazy, on-demand). +- Playwright sidecar accepts new optional fields: `waitFor` (CSS selector), `waitTimeoutMs` (capped at 15000), `mobileUa` (boolean). Backwards compatible — old fields are silently passed through. +- Initial default recipes covering Future PLC sites (paywall + recommendation widgets) and GitHub Issues (JS-rendered comments). +- The Playwright sidecar bundles `playwright-stealth` to mitigate `navigator.webdriver`-style headless detection on JS-driven anti-bot pages. + +### Known limitations + +- **Sites behind cookie-based consent walls** (third-party CMP frameworks like TCF v2) are not unlocked by recipes alone in this release. Such sites redirect non-consenting visitors to a JS-rendered consent UI and only return article content once HttpOnly cookies are set after a click. A future release will add a `fetch.cookies` recipe field so operators can paste their own consent state when they choose to. For now, write a custom recipe with whatever combination of `select.remove`, `extractor`, and `fetch` settings works for your specific source — the engine supports the experimentation, the defaults stay conservative. + +### Important — `:latest` tag stays on v1.x + +The `:latest` tag in Docker Hub and GHCR remains pinned to v1.x until the scheduled flip on 2026-05-16. Self-hosters wanting the recipe engine **must pin `:v2.2.0`** (or `:2.2`) explicitly for both `pullmd` and `pullmd-playwright`. Pulling `:latest` continues to give you v1, **without** the recipe engine. + +```yaml +services: + pullmd: + image: aeternalabshq/pullmd:2.2.0 + playwright: + image: aeternalabshq/pullmd-playwright:2.2.0 +``` + +### Migration + +- New `meta` table created automatically on first boot. No action required. +- Existing cache rows remain valid until the first recipe content change is detected. +- See `MIGRATION.md` for the full upgrade path. + ## v2.1.0 — 2026-05-05 ### Added diff --git a/MIGRATION.md b/MIGRATION.md index 879659a..3f05597 100644 --- a/MIGRATION.md +++ b/MIGRATION.md @@ -69,3 +69,68 @@ If something goes wrong: 4. Restart. The `users`/`sessions`/`api_keys`/`user_fetches` tables and the `user_id` column on `conversions` are unused by v1.x and can stay if you're not restoring; v1.x ignores them. + +# Migrating from v2.1.x to v2.2.0 + +v2.2.0 ships the Site Recipe Engine (#18). Pure additive change — existing instances keep working unchanged. This section covers what to know if you want to use recipes. + +## Pin v2 tags explicitly + +`:latest` stays on v1.x until 2026-05-16. Update your compose / k8s manifests: + +```yaml +# Before +image: aeternalabshq/pullmd:latest +# After +image: aeternalabshq/pullmd:2.2.0 +# Also bump the playwright sidecar — wait_for and mobile_ua need the new sidecar: +image: aeternalabshq/pullmd-playwright:2.2.0 +``` + +## Optional: mount user recipes + +The default recipes in `site-recipes.default.json` cover Future PLC sites and GitHub Issues out of the box. To add your own: + +```yaml +services: + pullmd: + image: aeternalabshq/pullmd:2.2.0 + volumes: + - ./data:/app/data + # Drop your custom recipes at ./data/site-recipes.json on the host + # PullMD auto-discovers it. Or set PULLMD_SITE_RECIPES to a different path: + environment: + - PULLMD_SITE_RECIPES=/path/to/your/recipes.json +``` + +User recipes are concatenated with the defaults. On scalar conflicts (e.g. both define `extractor` for the same host), the user file wins via ordering. + +## Schema migrations + +The `meta` table is created automatically on first boot — no manual SQL. Existing cache rows remain valid until the first recipe content change is detected (the SHA256 of recipe file content is hashed at boot; on change, `recipes_invalidated_at` is bumped and old cache rows lazy-refresh on next access). + +## Monitoring + +`GET /api/recipes/status` returns `{ ok, loaded, rejected, sources }` — public, no auth. Add it to UptimeKuma / Healthchecks / equivalent to be alerted when a recipe fails to parse: + +```json +{ + "ok": true, + "loaded": 5, + "rejected": 0, + "sources": [ + { "path": "site-recipes.default.json", "loaded": 4, "rejected": 0 }, + { "path": "/app/data/site-recipes.json", "loaded": 1, "rejected": 0 } + ] +} +``` + +`ok = (rejected === 0)`. HTTP always returns 200; use the `ok` field for monitoring decisions. Rejection details are in stderr at server start (`docker logs pullmd | grep recipes`). + +## Rolling back to v2.1.x + +The schema change is additive (new `meta` table, no column changes on existing tables). To roll back: + +1. Stop v2.2.0 container. +2. Pin to `aeternalabshq/pullmd:2.1.0`. +3. Restart. The `meta` table stays — v2.1.x ignores it. diff --git a/lib/cache.js b/lib/cache.js index 8097d83..5e8c93f 100644 --- a/lib/cache.js +++ b/lib/cache.js @@ -95,6 +95,14 @@ export function createCache(dbPath = '/data/cache.db') { db.exec(`CREATE INDEX IF NOT EXISTS idx_user_fetches_fetched_at ON user_fetches(fetched_at)`); db.exec(`CREATE UNIQUE INDEX IF NOT EXISTS idx_user_fetches_unique ON user_fetches(user_id, cache_id)`); + db.exec(` + CREATE TABLE IF NOT EXISTS meta ( + key TEXT PRIMARY KEY, + value TEXT NOT NULL, + updated_at TEXT DEFAULT (datetime('now')) + ) + `); + // Migrate: add share_id column if missing const cols = db.prepare("PRAGMA table_info(conversions)").all().map(c => c.name); if (!cols.includes('share_id')) { @@ -109,6 +117,8 @@ export function createCache(dbPath = '/data/cache.db') { db.exec('CREATE INDEX IF NOT EXISTS idx_conversions_user_id ON conversions(user_id)'); } + let recipesInvalidatedAt = '1970-01-01 00:00:00'; + const stmts = { upsert: db.prepare(` INSERT INTO conversions (url, title, markdown, source, share_id, client, user_id, created_at) @@ -124,7 +134,9 @@ export function createCache(dbPath = '/data/cache.db') { `), get: db.prepare(` SELECT title, markdown, source, share_id, client, created_at FROM conversions - WHERE url = ? AND created_at > datetime('now', '-1 hour') + WHERE url = ? + AND created_at > datetime('now', '-1 hour') + AND created_at > ? `), getByShareId: db.prepare(` SELECT url, title, markdown, source, client, created_at FROM conversions @@ -192,6 +204,11 @@ export function createCache(dbPath = '/data/cache.db') { LIMIT ? OFFSET ? `), countForUser: db.prepare(`SELECT COUNT(*) as total FROM user_fetches WHERE user_id = ?`), + metaGet: db.prepare(`SELECT value FROM meta WHERE key = ?`), + metaSet: db.prepare(` + INSERT INTO meta (key, value, updated_at) VALUES (?, ?, datetime('now')) + ON CONFLICT(key) DO UPDATE SET value = excluded.value, updated_at = datetime('now') + `), }; return { @@ -214,7 +231,8 @@ export function createCache(dbPath = '/data/cache.db') { }, get(url) { - return stmts.get.get(url) || null; + const row = stmts.get.get(url, recipesInvalidatedAt); + return row || null; }, getByShareId(shareId) { @@ -305,5 +323,17 @@ export function createCache(dbPath = '/data/cache.db') { })); return { total, window, bySource, lowQualityDomains, fallbackByDomain }; }, + + getMeta(key) { + const row = stmts.metaGet.get(key); + return row ? row.value : null; + }, + setMeta(key, value) { + stmts.metaSet.run(key, value); + }, + setRecipesInvalidatedAt(iso) { + recipesInvalidatedAt = iso; + stmts.metaSet.run('recipes_invalidated_at', iso); + }, }; } diff --git a/lib/playwright-client.js b/lib/playwright-client.js index 3ed356d..0b91228 100644 --- a/lib/playwright-client.js +++ b/lib/playwright-client.js @@ -9,7 +9,7 @@ const SIDECAR_TIMEOUT_MS = 25_000; * @param {typeof fetch} [opts.fetch] Injectable for tests * @returns {Promise} rendered HTML */ -export async function renderViaSidecar(url, { signal, fetch: fetchFn = globalThis.fetch } = {}) { +export async function renderViaSidecar(url, { signal, fetch: fetchFn = globalThis.fetch, waitFor, waitTimeoutMs, mobileUa, userAgent } = {}) { if (!process.env.PLAYWRIGHT_URL) throw new Error('Playwright sidecar not configured (PLAYWRIGHT_URL env)'); const ctrl = new AbortController(); @@ -20,11 +20,17 @@ export async function renderViaSidecar(url, { signal, fetch: fetchFn = globalThi else signal.addEventListener('abort', onAbort, { once: true }); } + const body = { url }; + if (waitFor !== undefined) body.waitFor = waitFor; + if (waitTimeoutMs !== undefined) body.waitTimeoutMs = waitTimeoutMs; + if (mobileUa !== undefined) body.mobileUa = mobileUa; + if (userAgent !== undefined) body.userAgent = userAgent; + try { const res = await fetchFn(process.env.PLAYWRIGHT_URL, { method: 'POST', headers: { 'Content-Type': 'application/json' }, - body: JSON.stringify({ url }), + body: JSON.stringify(body), signal: ctrl.signal, }); if (!res.ok) throw new Error(`Sidecar returned ${res.status}`); diff --git a/lib/recipes.js b/lib/recipes.js new file mode 100644 index 0000000..75600dc --- /dev/null +++ b/lib/recipes.js @@ -0,0 +1,254 @@ +import { z } from 'zod'; +import fs from 'node:fs'; +import path from 'node:path'; +import { createHash } from 'node:crypto'; +import * as cheerio from 'cheerio'; + +const ActionSchema = z.discriminatedUnion('action', [ + z.object({ action: z.literal('remove-attr'), selector: z.string().min(1), attr: z.string().min(1) }), + z.object({ action: z.literal('remove-class'), selector: z.string().min(1), class: z.string().min(1) }), + z.object({ action: z.literal('remove-element'), selector: z.string().min(1) }), + z.object({ action: z.literal('unwrap'), selector: z.string().min(1) }), +]); + +const FetchSchema = z.object({ + render: z.enum(['force', 'skip']).optional(), + wait_for: z.string().min(1).optional(), + wait_timeout_ms: z.number().int().min(0).max(15000).optional(), + mobile_ua: z.boolean().optional(), +}).strict(); + +const SelectSchema = z.object({ + remove: z.array(z.string().min(1)).default([]), +}).strict(); + +export const RecipeSchema = z.object({ + name: z.string().min(1), + host: z.union([z.string().min(1), z.array(z.string().min(1)).min(1)]), + path: z.string().min(1).default('/**'), + preprocess: z.array(ActionSchema).default([]), + select: SelectSchema.default({ remove: [] }), + extractor: z.enum(['readability', 'trafilatura', 'playwright']).optional(), + fetch: FetchSchema.default({}), +}).strict(); + +let cachedState = null; + +function loadOneFile(filePath) { + if (!filePath || !fs.existsSync(filePath)) { + return { loaded: [], rejected: [], present: false }; + } + let raw; + try { + raw = fs.readFileSync(filePath, 'utf8'); + } catch (err) { + console.warn(`[recipes] cannot read ${filePath}: ${err.message}`); + return { loaded: [], rejected: [], present: true }; + } + let parsed; + try { + parsed = JSON.parse(raw); + } catch (err) { + console.warn(`[recipes] ${filePath} is not valid JSON: ${err.message}`); + return { loaded: [], rejected: [], present: true }; + } + if (!Array.isArray(parsed)) { + console.warn(`[recipes] ${filePath} root must be an array`); + return { loaded: [], rejected: [], present: true }; + } + + const loaded = []; + const rejected = []; + const seenNames = new Set(); + parsed.forEach((entry, index) => { + const result = RecipeSchema.safeParse(entry); + if (!result.success) { + const msg = result.error.issues + .map((i) => `${i.path.join('.')}: ${i.message}`) + .join('; '); + console.warn(`[recipes] ${filePath} — recipe #${index} rejected: ${msg}`); + rejected.push({ index, name: entry?.name ?? null, message: msg }); + return; + } + if (seenNames.has(result.data.name)) { + console.warn(`[recipes] ${filePath} — duplicate name "${result.data.name}", later entry wins`); + const existingIdx = loaded.findIndex((r) => r.name === result.data.name); + if (existingIdx >= 0) loaded.splice(existingIdx, 1); + } + seenNames.add(result.data.name); + loaded.push(result.data); + }); + return { loaded, rejected, present: true }; +} + +function resolveUserPath() { + const env = process.env.PULLMD_SITE_RECIPES; + if (env) return env; // explicit always wins + const auto = path.resolve(process.cwd(), 'data/site-recipes.json'); + return fs.existsSync(auto) ? auto : null; +} + +export function loadRecipes(opts = {}) { + const defaultPath = opts.defaultPath ?? path.resolve(process.cwd(), 'site-recipes.default.json'); + const userPath = opts.userPath ?? resolveUserPath(); + + const sources = []; + let allLoaded = []; + let totalRejected = 0; + + for (const filePath of [defaultPath, userPath]) { + if (!filePath) continue; + const { loaded, rejected, present } = loadOneFile(filePath); + if (!present) continue; + sources.push({ path: filePath, loaded: loaded.length, rejected: rejected.length }); + allLoaded = allLoaded.concat(loaded); + totalRejected += rejected.length; + console.log(`[recipes] loaded ${filePath}: ${loaded.length} ok, ${rejected.length} rejected`); + } + + cachedState = { + recipes: allLoaded, + status: { + loaded: allLoaded.length, + rejected: totalRejected, + sources, + }, + }; + return cachedState; +} + +export function getRecipeStatus() { + if (!cachedState) return { loaded: 0, rejected: 0, sources: [] }; + return cachedState.status; +} + +function globToRegex(glob) { + // Escape every regex-special char EXCEPT '*'; then translate '*' to '.*'. + const escaped = glob.replace(/[.+?^${}()|[\]\\]/g, '\\$&').replace(/\*/g, '.*'); + return new RegExp('^' + escaped + '$', 'i'); +} + +export function hostMatches(pattern, host) { + const patterns = Array.isArray(pattern) ? pattern : [pattern]; + return patterns.some((p) => globToRegex(p).test(host)); +} + +function pathGlobToRegex(glob) { + // Translate ** before *, escape regex-specials in between. + // Strategy: walk char-by-char, recognize ** and * tokens, escape literals. + let result = ''; + let i = 0; + while (i < glob.length) { + if (glob[i] === '*' && glob[i + 1] === '*') { + result += '.*'; + i += 2; + } else if (glob[i] === '*') { + result += '[^/]+'; + i += 1; + } else { + result += glob[i].replace(/[.+?^${}()|[\]\\]/g, '\\$&'); + i += 1; + } + } + return new RegExp('^' + result + '$'); +} + +export function pathMatches(pattern, urlPath) { + return pathGlobToRegex(pattern).test(urlPath); +} + +export function mergeRecipes(recipes) { + const result = { + preprocess: [], + removeSelectors: [], + extractor: undefined, + fetch: {}, + }; + for (const r of recipes) { + result.preprocess = result.preprocess.concat(r.preprocess || []); + result.removeSelectors = result.removeSelectors.concat(r.select?.remove || []); + if (r.extractor !== undefined) result.extractor = r.extractor; + if (r.fetch) { + for (const key of ['render', 'wait_for', 'wait_timeout_ms', 'mobile_ua']) { + if (r.fetch[key] !== undefined) result.fetch[key] = r.fetch[key]; + } + } + } + return result; +} + +export function matchRecipesAgainst(recipes, url) { + const host = url.hostname; + const urlPath = url.pathname || '/'; + const matched = recipes.filter( + (r) => hostMatches(r.host, host) && pathMatches(r.path || '/**', urlPath), + ); + return mergeRecipes(matched); +} + +export function matchRecipes(url) { + if (!cachedState) return mergeRecipes([]); + return matchRecipesAgainst(cachedState.recipes, url); +} + +export function computeRecipesHash(filePaths) { + const hash = createHash('sha256'); + for (const p of filePaths) { + if (!p) continue; + if (fs.existsSync(p)) { + hash.update(p, 'utf8'); + hash.update('\n', 'utf8'); + hash.update(fs.readFileSync(p)); + hash.update('\n', 'utf8'); + } + } + return hash.digest('hex'); +} + +export function applyRecipesInvalidation(cache, newHash) { + const oldHash = cache.getMeta('recipes_hash'); + if (oldHash !== newHash) { + if (oldHash !== null) { + // Hash truly changed across reboots — bump invalidation timestamp. + // First boot (oldHash === null) does NOT bump: existing cache rows stay valid + // until the operator actually changes recipes. + const now = new Date().toISOString().replace('T', ' ').slice(0, 19); + cache.setRecipesInvalidatedAt(now); + } + cache.setMeta('recipes_hash', newHash); + } +} + +export function applyPreprocessActions(html, actions) { + if (!html || typeof html !== 'string') return html; + if (!actions || actions.length === 0) return html; + + const $ = cheerio.load(html, { decodeEntities: false }); + for (const action of actions) { + switch (action.action) { + case 'remove-attr': + $(action.selector).removeAttr(action.attr); + break; + case 'remove-class': + $(action.selector).each((_, el) => { + const $el = $(el); + const cls = $el.attr('class'); + if (!cls) return; + const tokens = cls.split(/\s+/).filter((t) => t && t !== action.class); + if (tokens.length === 0) $el.removeAttr('class'); + else $el.attr('class', tokens.join(' ')); + }); + break; + case 'remove-element': + $(action.selector).remove(); + break; + case 'unwrap': + $(action.selector).each((_, el) => { + const $el = $(el); + $el.replaceWith($el.contents()); + }); + break; + } + } + return $.html(); +} diff --git a/lib/web.js b/lib/web.js index 8cde41d..d91fa27 100644 --- a/lib/web.js +++ b/lib/web.js @@ -7,6 +7,7 @@ import { renderDecision } from './render-decision.js'; import { renderViaSidecar } from './playwright-client.js'; import { pickUserAgent, maybeRefreshUaPool } from './user-agent.js'; import { preprocess } from './preprocess.js'; +import { matchRecipes, matchRecipesAgainst, applyPreprocessActions } from './recipes.js'; const TRAFILATURA_URL = process.env.TRAFILATURA_URL; const TRAFILATURA_TIMEOUT_MS = 8_000; @@ -157,8 +158,11 @@ const REMOVE_SELECTORS = [ // Strict UUID v4 — used to detect CMS-asset-ID leakage in . const UUID_ALT_RE = /^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i; -function cleanDom(document) { - [...document.querySelectorAll(REMOVE_SELECTORS)].forEach(el => el.remove()); +function cleanDom(document, extraRemoveSelectors = []) { + const allRemove = extraRemoveSelectors.length > 0 + ? REMOVE_SELECTORS + ', ' + extraRemoveSelectors.join(', ') + : REMOVE_SELECTORS; + [...document.querySelectorAll(allRemove)].forEach(el => el.remove()); // Surface readonly text-input values as so click-to-copy slugs // (API model names, embed snippets, share links, …) survive extraction. @@ -181,8 +185,11 @@ function cleanDom(document) { } } -async function convertWithReadability(url, html, comments, statusCode, fetchFn, extractor) { - const cleanedHtml = preprocess(html); +async function convertWithReadability(url, html, comments, statusCode, fetchFn, extractor, recipe) { + let cleanedHtml = preprocess(html); + if (recipe?.preprocess?.length) { + cleanedHtml = applyPreprocessActions(cleanedHtml, recipe.preprocess); + } const { document } = parseHTML(cleanedHtml); const title = document.querySelector('title')?.textContent?.trim() || new URL(url).hostname; @@ -192,7 +199,7 @@ async function convertWithReadability(url, html, comments, statusCode, fetchFn, metadata.sourceUrl = url; metadata.statusCode = statusCode; - cleanDom(document); + cleanDom(document, recipe?.removeSelectors || []); // Comments path: skip Readability and Trafilatura, use cleaned body if (comments) { @@ -211,7 +218,7 @@ async function convertWithReadability(url, html, comments, statusCode, fetchFn, readabilityMd = nhm.translate(article.content); } else { const { document: doc2 } = parseHTML(cleanedHtml); - cleanDom(doc2); + cleanDom(doc2, recipe?.removeSelectors || []); readabilityMd = nhm.translate(doc2.querySelector('body')?.innerHTML || cleanedHtml); readabilityFellBack = true; } @@ -291,8 +298,19 @@ export async function extractWeb(url, options = {}) { extractor, // 'readability' | 'trafilatura' | 'playwright' | undefined renderClient = renderViaSidecar, // injectable for tests } = options; - // extractor=playwright implies forced render — Playwright must run. - const effectiveRender = extractor === 'playwright' ? 'force' : render; + + // Resolve recipes: tests may pass options.recipes directly; production reads + // from the module-level cache populated by loadRecipes() at server boot. + const recipe = options.recipes + ? matchRecipesAgainst(options.recipes, new URL(url)) + : matchRecipes(new URL(url)); + + // Hook 1: query-param wins over recipe; recipe wins over no-default + const queryRender = (render === 'force' || render === 'skip') ? render : undefined; + const effectiveExtractor = extractor || recipe.extractor; + const effectiveRender = effectiveExtractor === 'playwright' + ? 'force' + : (queryRender ?? recipe.fetch.render); const rawFetchFn = options.fetch || globalThis.fetch; const fetchFn = withTimeout(rawFetchFn); @@ -300,7 +318,8 @@ export async function extractWeb(url, options = {}) { // TTL check; the actual refresh (if any) does not block this request. void maybeRefreshUaPool(); - const headers = { 'User-Agent': pickUserAgent() }; + const userAgent = pickUserAgent(); + const headers = { 'User-Agent': userAgent }; if (!comments) { headers['Accept'] = 'text/markdown, text/html;q=0.9, */*;q=0.8'; } @@ -335,7 +354,7 @@ export async function extractWeb(url, options = {}) { } // First pass: static extraction (Readability + Trafilatura + pickBest) - const result = await convertWithReadability(url, body, comments, statusCode, rawFetchFn, extractor); + const result = await convertWithReadability(url, body, comments, statusCode, rawFetchFn, effectiveExtractor, recipe); emit('extracting', { source: result.source }); // Decide whether to render via Playwright sidecar @@ -346,10 +365,16 @@ export async function extractWeb(url, options = {}) { // Second pass: render via sidecar, re-extract on rendered HTML try { - const renderedHtml = await renderClient(url, { signal }); - const rendered = await convertWithReadability(url, renderedHtml, comments, statusCode, rawFetchFn, extractor); + const renderedHtml = await renderClient(url, { + signal, + waitFor: recipe?.fetch?.wait_for, + waitTimeoutMs: recipe?.fetch?.wait_timeout_ms, + mobileUa: recipe?.fetch?.mobile_ua, + userAgent, + }); + const rendered = await convertWithReadability(url, renderedHtml, comments, statusCode, rawFetchFn, effectiveExtractor, recipe); rendered.source = 'playwright'; - const renderReason = extractor === 'playwright' + const renderReason = effectiveExtractor === 'playwright' ? 'forced via extractor=playwright' : `${decision.reason} → rendered via playwright`; rendered.metadata.extractorReason = renderReason; diff --git a/package-lock.json b/package-lock.json index 0ae4111..1bbf341 100644 --- a/package-lock.json +++ b/package-lock.json @@ -1,12 +1,12 @@ { "name": "pullmd", - "version": "2.0.0", + "version": "2.2.0", "lockfileVersion": 3, "requires": true, "packages": { "": { "name": "pullmd", - "version": "2.0.0", + "version": "2.2.0", "license": "AGPL-3.0-or-later", "dependencies": { "@modelcontextprotocol/sdk": "^1.29.0", diff --git a/package.json b/package.json index 586d626..5dfb742 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "pullmd", - "version": "2.1.0", + "version": "2.2.0", "type": "module", "main": "server.js", "license": "AGPL-3.0-or-later", diff --git a/playwright-sidecar/app.py b/playwright-sidecar/app.py index 6b1948c..67f5f5f 100644 --- a/playwright-sidecar/app.py +++ b/playwright-sidecar/app.py @@ -26,6 +26,22 @@ log = logging.getLogger("playwright-sidecar") state: dict = {"browser": None, "pw": None, "sem": asyncio.Semaphore(MAX_CONCURRENCY)} +# Stealth: defeat navigator.webdriver and other headless markers. +# API has changed across versions; try the current modern entrypoint with a fallback. +try: + from playwright_stealth import Stealth as _Stealth + _stealth = _Stealth() + async def _apply_stealth(page): + # Modern API (>= 2.x): instance method on Stealth + await _stealth.apply_stealth_async(page) +except (ImportError, AttributeError): + try: + from playwright_stealth import stealth_async as _apply_stealth # legacy 1.x + except ImportError: + async def _apply_stealth(page): + pass + log.warning("playwright-stealth not installed; running without bot-detection mitigation") + @asynccontextmanager async def lifespan(_app: FastAPI): @@ -44,6 +60,10 @@ async def lifespan(_app: FastAPI): class RenderRequest(BaseModel): url: str + waitFor: str | None = None + waitTimeoutMs: int | None = None + mobileUa: bool = False + userAgent: str | None = None @app.get("/health") @@ -56,15 +76,48 @@ def health(): } -async def _render(url: str) -> str: - context = await state["browser"].new_context(user_agent=USER_AGENT) +async def _render(url: str, wait_for: str | None = None, wait_timeout_ms: int | None = None, mobile_ua: bool = False, user_agent: str | None = None) -> str: + if mobile_ua: + device = state["pw"].devices.get("iPhone 13") + if device is None: + # Fallback: hand-crafted mobile context if the device profile is unavailable + context = await state["browser"].new_context( + user_agent=( + "Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) " + "AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Mobile/15E148 Safari/604.1" + ), + viewport={"width": 390, "height": 844}, + device_scale_factor=3, + is_mobile=True, + has_touch=True, + ) + else: + context = await state["browser"].new_context(**device) + else: + context = await state["browser"].new_context(user_agent=user_agent or USER_AGENT) + try: page = await context.new_page() - await page.goto(url, wait_until="domcontentloaded", timeout=NAV_TIMEOUT_MS) try: - await page.wait_for_load_state("networkidle", timeout=NETWORKIDLE_TIMEOUT_MS) - except PWTimeout: - log.info("networkidle timeout, returning current DOM: %s", url) + await _apply_stealth(page) + except Exception as e: + log.warning("stealth apply failed (non-fatal): %s", e) + await page.goto(url, wait_until="domcontentloaded", timeout=NAV_TIMEOUT_MS) + + if wait_for: + # Recipe-driven: wait for a specific selector instead of networkidle + timeout = max(0, min(wait_timeout_ms or 5000, 15_000)) + try: + await page.wait_for_selector(wait_for, timeout=timeout) + except PWTimeout: + log.info("wait_for selector timeout, returning current DOM: %s (selector=%s)", url, wait_for) + else: + # Default behavior: wait for networkidle as before + try: + await page.wait_for_load_state("networkidle", timeout=NETWORKIDLE_TIMEOUT_MS) + except PWTimeout: + log.info("networkidle timeout, returning current DOM: %s", url) + return await page.content() finally: await context.close() @@ -81,7 +134,10 @@ async def render(req: RenderRequest): async with sem: try: - return await asyncio.wait_for(_render(req.url), timeout=HARD_TIMEOUT_S) + return await asyncio.wait_for( + _render(req.url, wait_for=req.waitFor, wait_timeout_ms=req.waitTimeoutMs, mobile_ua=req.mobileUa, user_agent=req.userAgent), + timeout=HARD_TIMEOUT_S, + ) except asyncio.TimeoutError: raise HTTPException(status_code=504, detail=f"render timeout after {HARD_TIMEOUT_S}s") except Exception as exc: diff --git a/playwright-sidecar/requirements.txt b/playwright-sidecar/requirements.txt index 1e9abbc..9ce42cf 100644 --- a/playwright-sidecar/requirements.txt +++ b/playwright-sidecar/requirements.txt @@ -1,3 +1,4 @@ fastapi==0.115.0 uvicorn[standard]==0.32.0 playwright==1.49.0 +playwright-stealth diff --git a/server.js b/server.js index c97b966..eb0af65 100644 --- a/server.js +++ b/server.js @@ -7,6 +7,9 @@ import { qualityScore } from './lib/scoring.js'; import { buildFrontmatter } from './lib/frontmatter.js'; import { mcpHandler } from './lib/mcp.js'; import { renderHelp, renderIndex, getSkillZip, publicUrlFor } from './lib/distrib.js'; +import { getRecipeStatus, loadRecipes, applyRecipesInvalidation, computeRecipesHash } from './lib/recipes.js'; +import path from 'node:path'; +import fs from 'node:fs'; function stripMarkdown(md) { return md @@ -498,6 +501,17 @@ export function createApp(overrides = {}) { } }); + app.get('/api/recipes/status', (req, res) => { + const status = getRecipeStatus(); + const ok = status.rejected === 0; + res.json({ + ok, + loaded: status.loaded, + rejected: status.rejected, + sources: status.sources, + }); + }); + app.get('/api/stats', (req, res) => { if (!cache) return res.json({ total: 0, window: '-7 days' }); const window = req.query.window || '-7 days'; @@ -585,6 +599,20 @@ if (isDirectRun || process.argv[1]?.endsWith('server.js')) { } throw err; } + // Load site recipes (default + optional user overlay) + const defaultRecipesPath = path.resolve(process.cwd(), 'site-recipes.default.json'); + const userRecipesPath = process.env.PULLMD_SITE_RECIPES + || (fs.existsSync(path.resolve(process.cwd(), 'data/site-recipes.json')) + ? path.resolve(process.cwd(), 'data/site-recipes.json') + : null); + loadRecipes({ defaultPath: defaultRecipesPath, userPath: userRecipesPath }); + + // Hash recipe content; if changed since last boot, invalidate cache. + const recipesHash = computeRecipesHash([defaultRecipesPath, userRecipesPath].filter(Boolean)); + applyRecipesInvalidation(cache, recipesHash); + const invalidationStamp = cache.getMeta('recipes_invalidated_at'); + if (invalidationStamp) cache.setRecipesInvalidatedAt(invalidationStamp); + const app = createApp({ cache, auth }); app.listen(port, () => { console.log(`PullMD running on http://localhost:${port} (auth: ${mode})`); diff --git a/site-recipes.default.json b/site-recipes.default.json new file mode 100644 index 0000000..700c98b --- /dev/null +++ b/site-recipes.default.json @@ -0,0 +1,45 @@ +[ + { + "name": "future-plc-paywall-aria", + "host": [ + "*.windowscentral.com", + "*.gamesradar.com", + "*.techradar.com", + "*.tomshardware.com", + "*.pcgamer.com", + "*.t3.com" + ], + "preprocess": [ + { "action": "remove-attr", "selector": "p[aria-hidden=\"true\"]", "attr": "aria-hidden" }, + { "action": "remove-class", "selector": "p.paywall", "class": "paywall" } + ] + }, + { + "name": "future-plc-recommendations", + "host": [ + "*.windowscentral.com", + "*.gamesradar.com", + "*.techradar.com", + "*.tomshardware.com", + "*.pcgamer.com", + "*.t3.com" + ], + "select": { + "remove": [ + "aside[class*=\"you-may-like\" i]", + "div.related-articles", + "[data-component=\"recommendations\"]" + ] + } + }, + { + "name": "github-issues", + "host": "github.com", + "path": "/*/*/issues/*", + "fetch": { + "render": "force", + "wait_for": ".js-comment-body", + "wait_timeout_ms": 5000 + } + } +] diff --git a/test/cache.test.js b/test/cache.test.js index 4ee69ba..9444309 100644 --- a/test/cache.test.js +++ b/test/cache.test.js @@ -198,3 +198,52 @@ describe('cache', () => { }); }); }); + +describe('cache — recipes invalidation in get()', () => { + it('returns null when row created_at < recipes_invalidated_at', () => { + const c = createCache(':memory:'); + c.put({ url: 'https://x.com', title: 'T', markdown: '# T', source: 'readability' }); + // Set invalidation timestamp AFTER the row was inserted + const future = new Date(Date.now() + 1000).toISOString().replace('T', ' ').slice(0, 19); + c.setRecipesInvalidatedAt(future); + assert.equal(c.get('https://x.com'), null); + }); + + it('still returns the row when invalidation timestamp is in the past', () => { + const c = createCache(':memory:'); + c.setRecipesInvalidatedAt('1970-01-01 00:00:00'); + c.put({ url: 'https://x.com', title: 'T', markdown: '# T', source: 'readability' }); + const hit = c.get('https://x.com'); + assert.ok(hit); + assert.equal(hit.title, 'T'); + }); + + it('default (no setRecipesInvalidatedAt called) treats all rows as fresh re: recipes', () => { + const c = createCache(':memory:'); + c.put({ url: 'https://x.com', title: 'T', markdown: '# T', source: 'readability' }); + const hit = c.get('https://x.com'); + assert.ok(hit); + }); +}); + +describe('cache — meta table', () => { + it('creates the meta table on init', () => { + const c = createCache(':memory:'); + assert.equal(c.getMeta('any-missing-key'), null); + c.setMeta('foo', 'bar'); + assert.equal(c.getMeta('foo'), 'bar'); + }); + + it('overwrites existing key on setMeta', () => { + const c = createCache(':memory:'); + c.setMeta('foo', 'one'); + c.setMeta('foo', 'two'); + assert.equal(c.getMeta('foo'), 'two'); + }); + + it('exposes setRecipesInvalidatedAt + reads it back via meta', () => { + const c = createCache(':memory:'); + c.setRecipesInvalidatedAt('2026-05-06 12:00:00'); + assert.equal(c.getMeta('recipes_invalidated_at'), '2026-05-06 12:00:00'); + }); +}); diff --git a/test/fixtures/recipes/default.json b/test/fixtures/recipes/default.json new file mode 100644 index 0000000..91e0528 --- /dev/null +++ b/test/fixtures/recipes/default.json @@ -0,0 +1,14 @@ +[ + { + "name": "fixture-paywall", + "host": "*.example.com", + "preprocess": [ + { "action": "remove-attr", "selector": "p[aria-hidden=\"true\"]", "attr": "aria-hidden" } + ] + }, + { + "name": "fixture-extractor", + "host": "blog.example.com", + "extractor": "trafilatura" + } +] diff --git a/test/fixtures/recipes/invalid.json b/test/fixtures/recipes/invalid.json new file mode 100644 index 0000000..03410a9 --- /dev/null +++ b/test/fixtures/recipes/invalid.json @@ -0,0 +1,4 @@ +[ + { "name": "valid-one", "host": "ok.example.com" }, + { "name": "invalid-one", "host": "bad.example.com", "preprocess": [{ "action": "acton", "selector": "p", "attr": "x" }] } +] diff --git a/test/fixtures/recipes/user.json b/test/fixtures/recipes/user.json new file mode 100644 index 0000000..7dfe3a1 --- /dev/null +++ b/test/fixtures/recipes/user.json @@ -0,0 +1,12 @@ +[ + { + "name": "fixture-extractor", + "host": "blog.example.com", + "extractor": "playwright" + }, + { + "name": "fixture-user-only", + "host": "user-only.example.com", + "select": { "remove": ["aside.ads"] } + } +] diff --git a/test/playwright-client.test.js b/test/playwright-client.test.js index 77e8746..9d54d6c 100644 --- a/test/playwright-client.test.js +++ b/test/playwright-client.test.js @@ -54,3 +54,52 @@ describe('renderViaSidecar', () => { ); }); }); + +describe('renderViaSidecar — recipe-driven options', () => { + it('forwards waitFor, waitTimeoutMs, mobileUa in POST body', async () => { + let captured; + const mockFetch = async (url, opts) => { + captured = JSON.parse(opts.body); + return { ok: true, text: async () => '' }; + }; + process.env.PLAYWRIGHT_URL = 'http://sidecar.test/'; + const { renderViaSidecar } = await import('../lib/playwright-client.js'); + await renderViaSidecar('https://example.com/', { + fetch: mockFetch, + waitFor: '.x', + waitTimeoutMs: 2500, + mobileUa: true, + }); + assert.equal(captured.url, 'https://example.com/'); + assert.equal(captured.waitFor, '.x'); + assert.equal(captured.waitTimeoutMs, 2500); + assert.equal(captured.mobileUa, true); + }); + + it('emits only url when no recipe options set (backwards compat)', async () => { + let captured; + const mockFetch = async (url, opts) => { + captured = JSON.parse(opts.body); + return { ok: true, text: async () => '' }; + }; + process.env.PLAYWRIGHT_URL = 'http://sidecar.test/'; + const { renderViaSidecar } = await import('../lib/playwright-client.js'); + await renderViaSidecar('https://example.com/', { fetch: mockFetch }); + assert.deepEqual(Object.keys(captured), ['url']); + }); + + it('forwards userAgent in POST body when set', async () => { + let captured; + const mockFetch = async (url, opts) => { + captured = JSON.parse(opts.body); + return { ok: true, text: async () => '' }; + }; + process.env.PLAYWRIGHT_URL = 'http://sidecar.test/'; + const { renderViaSidecar } = await import('../lib/playwright-client.js'); + await renderViaSidecar('https://example.com/', { + fetch: mockFetch, + userAgent: 'Mozilla/5.0 (Test) Test/1.0', + }); + assert.equal(captured.userAgent, 'Mozilla/5.0 (Test) Test/1.0'); + }); +}); diff --git a/test/recipes-actions.test.js b/test/recipes-actions.test.js new file mode 100644 index 0000000..6ea88e9 --- /dev/null +++ b/test/recipes-actions.test.js @@ -0,0 +1,94 @@ +import { describe, it } from 'node:test'; +import assert from 'node:assert/strict'; +import { applyPreprocessActions } from '../lib/recipes.js'; + +describe('applyPreprocessActions — remove-attr', () => { + it('removes the named attribute from matching elements', () => { + const html = ''; + const out = applyPreprocessActions(html, [ + { action: 'remove-attr', selector: 'p', attr: 'aria-hidden' }, + ]); + assert.equal(out.includes('aria-hidden'), false); + assert.ok(out.includes('

x

')); + }); + + it('leaves non-matching elements alone', () => { + const html = ''; + const out = applyPreprocessActions(html, [ + { action: 'remove-attr', selector: 'p', attr: 'aria-hidden' }, + ]); + assert.ok(out.includes('')); + }); +}); + +describe('applyPreprocessActions — remove-class', () => { + it('removes the named class token, preserving others', () => { + const html = '

x

'; + const out = applyPreprocessActions(html, [ + { action: 'remove-class', selector: 'p', class: 'paywall' }, + ]); + assert.ok(out.includes('class="foo bar"') || out.includes('class="foo bar"')); + assert.equal(out.includes('paywall'), false); + }); + + it('removes the class attribute entirely if the only token is removed', () => { + const html = '

x

'; + const out = applyPreprocessActions(html, [ + { action: 'remove-class', selector: 'p', class: 'paywall' }, + ]); + assert.equal(out.includes('class='), false); + }); +}); + +describe('applyPreprocessActions — remove-element', () => { + it('removes the matching element and its descendants', () => { + const html = '

keep

'; + const out = applyPreprocessActions(html, [ + { action: 'remove-element', selector: 'aside.ads' }, + ]); + assert.equal(out.includes('drop'), false); + assert.ok(out.includes('keep')); + }); +}); + +describe('applyPreprocessActions — unwrap', () => { + it('replaces element with its children', () => { + const html = '

hello world!

'; + const out = applyPreprocessActions(html, [ + { action: 'unwrap', selector: 'span.wrap' }, + ]); + assert.ok(out.includes('hello world!')); + assert.equal(out.includes(' { + it('returns original HTML when actions list is empty', () => { + const html = '

x

'; + assert.equal(applyPreprocessActions(html, []), html); + }); + + it('no-op when selector matches nothing', () => { + const html = '

x

'; + const out = applyPreprocessActions(html, [ + { action: 'remove-attr', selector: 'div', attr: 'foo' }, + ]); + assert.ok(out.includes('

x

')); + }); + + it('returns input unchanged when html is empty/null', () => { + assert.equal(applyPreprocessActions('', []), ''); + assert.equal(applyPreprocessActions(null, []), null); + }); + + it('applies multiple actions in order', () => { + const html = ''; + const out = applyPreprocessActions(html, [ + { action: 'remove-attr', selector: 'p', attr: 'aria-hidden' }, + { action: 'remove-class', selector: 'p', class: 'paywall' }, + ]); + assert.equal(out.includes('aria-hidden'), false); + assert.equal(out.includes('paywall'), false); + assert.ok(out.includes('foo')); + }); +}); diff --git a/test/recipes-cache-invalidation.test.js b/test/recipes-cache-invalidation.test.js new file mode 100644 index 0000000..369378b --- /dev/null +++ b/test/recipes-cache-invalidation.test.js @@ -0,0 +1,58 @@ +import { describe, it } from 'node:test'; +import assert from 'node:assert/strict'; +import { computeRecipesHash, applyRecipesInvalidation } from '../lib/recipes.js'; +import { createCache } from '../lib/cache.js'; +import { fileURLToPath } from 'node:url'; +import path from 'node:path'; + +const here = path.dirname(fileURLToPath(import.meta.url)); +const fix = (rel) => path.join(here, 'fixtures/recipes', rel); + +describe('computeRecipesHash', () => { + it('returns a stable hex string for the same content', () => { + const a = computeRecipesHash([fix('default.json')]); + const b = computeRecipesHash([fix('default.json')]); + assert.equal(a, b); + assert.match(a, /^[0-9a-f]{64}$/); + }); + + it('returns a different hash when content differs', () => { + const a = computeRecipesHash([fix('default.json')]); + const b = computeRecipesHash([fix('default.json'), fix('user.json')]); + assert.notEqual(a, b); + }); + + it('handles missing files gracefully (treats as empty)', () => { + const a = computeRecipesHash([fix('default.json'), fix('does-not-exist.json')]); + const b = computeRecipesHash([fix('default.json')]); + assert.equal(a, b); + }); +}); + +describe('applyRecipesInvalidation', () => { + it('first boot: stores hash, leaves recipes_invalidated_at unset', () => { + const c = createCache(':memory:'); + assert.equal(c.getMeta('recipes_hash'), null); + applyRecipesInvalidation(c, 'hash-A'); + assert.equal(c.getMeta('recipes_hash'), 'hash-A'); + // Spec: on first boot, no invalidation stamp written (existing cache rows stay valid) + assert.equal(c.getMeta('recipes_invalidated_at'), null); + }); + + it('reboot, hash unchanged: no invalidation timestamp update', () => { + const c = createCache(':memory:'); + applyRecipesInvalidation(c, 'hash-A'); // first boot + const stamp = c.getMeta('recipes_invalidated_at'); + applyRecipesInvalidation(c, 'hash-A'); // unchanged + assert.equal(c.getMeta('recipes_invalidated_at'), stamp); + }); + + it('reboot, hash changed: invalidation timestamp updates to NOW', () => { + const c = createCache(':memory:'); + applyRecipesInvalidation(c, 'hash-A'); // first boot, no stamp yet + applyRecipesInvalidation(c, 'hash-B'); // change! + const stamp = c.getMeta('recipes_invalidated_at'); + assert.ok(stamp, 'invalidation stamp should be set'); + assert.match(stamp, /^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}$/); + }); +}); diff --git a/test/recipes-loader.test.js b/test/recipes-loader.test.js new file mode 100644 index 0000000..04cdca2 --- /dev/null +++ b/test/recipes-loader.test.js @@ -0,0 +1,155 @@ +import { describe, it } from 'node:test'; +import assert from 'node:assert/strict'; +import { RecipeSchema } from '../lib/recipes.js'; +import { loadRecipes } from '../lib/recipes.js'; +import { fileURLToPath } from 'node:url'; +import path from 'node:path'; + +const here = path.dirname(fileURLToPath(import.meta.url)); +const fix = (rel) => path.join(here, 'fixtures/recipes', rel); + +describe('RecipeSchema', () => { + it('accepts a minimal recipe with name + host', () => { + const result = RecipeSchema.safeParse({ name: 'r1', host: 'example.com' }); + assert.equal(result.success, true); + }); + + it('accepts host as string array', () => { + const result = RecipeSchema.safeParse({ name: 'r1', host: ['a.com', 'b.com'] }); + assert.equal(result.success, true); + }); + + it('rejects when name is missing', () => { + const result = RecipeSchema.safeParse({ host: 'example.com' }); + assert.equal(result.success, false); + }); + + it('rejects when host is missing', () => { + const result = RecipeSchema.safeParse({ name: 'r1' }); + assert.equal(result.success, false); + }); + + it('accepts all four preprocess actions', () => { + const recipe = { + name: 'r1', host: 'a.com', + preprocess: [ + { action: 'remove-attr', selector: 'p', attr: 'aria-hidden' }, + { action: 'remove-class', selector: 'p', class: 'paywall' }, + { action: 'remove-element', selector: 'aside.ads' }, + { action: 'unwrap', selector: 'span.wrapper' }, + ], + }; + assert.equal(RecipeSchema.safeParse(recipe).success, true); + }); + + it('rejects unknown preprocess action', () => { + const recipe = { + name: 'r1', host: 'a.com', + preprocess: [{ action: 'acton', selector: 'p', attr: 'x' }], + }; + assert.equal(RecipeSchema.safeParse(recipe).success, false); + }); + + it('accepts fetch options', () => { + const recipe = { + name: 'r1', host: 'a.com', + fetch: { render: 'force', wait_for: '.x', wait_timeout_ms: 5000, mobile_ua: true }, + }; + assert.equal(RecipeSchema.safeParse(recipe).success, true); + }); + + it('rejects fetch.render outside the enum', () => { + const recipe = { name: 'r1', host: 'a.com', fetch: { render: 'auto' } }; + assert.equal(RecipeSchema.safeParse(recipe).success, false); + }); + + it('caps fetch.wait_timeout_ms at 15000', () => { + const recipe = { name: 'r1', host: 'a.com', fetch: { wait_timeout_ms: 99999 } }; + assert.equal(RecipeSchema.safeParse(recipe).success, false); + }); + + it('accepts select.remove as string array', () => { + const recipe = { name: 'r1', host: 'a.com', select: { remove: ['aside', '.ads'] } }; + assert.equal(RecipeSchema.safeParse(recipe).success, true); + }); + + it('accepts extractor enum', () => { + for (const x of ['readability', 'trafilatura', 'playwright']) { + assert.equal(RecipeSchema.safeParse({ name: 'r1', host: 'a.com', extractor: x }).success, true); + } + }); + + it('rejects unknown extractor', () => { + assert.equal( + RecipeSchema.safeParse({ name: 'r1', host: 'a.com', extractor: 'magic' }).success, + false, + ); + }); +}); + +describe('loadRecipes — default file only', () => { + it('loads recipes from the default file', () => { + const { recipes, status } = loadRecipes({ defaultPath: fix('default.json') }); + assert.equal(recipes.length, 2); + assert.equal(recipes[0].name, 'fixture-paywall'); + assert.equal(status.loaded, 2); + assert.equal(status.rejected, 0); + assert.equal(status.sources.length, 1); + assert.equal(status.sources[0].loaded, 2); + }); + + it('returns empty + warning when default file is absent', () => { + const { recipes, status } = loadRecipes({ defaultPath: fix('does-not-exist.json') }); + assert.equal(recipes.length, 0); + assert.equal(status.loaded, 0); + assert.equal(status.sources.length, 0); + }); + + it('skips user file when not provided', () => { + const { status } = loadRecipes({ defaultPath: fix('default.json') }); + assert.equal(status.sources.length, 1); + }); +}); + +describe('loadRecipes — user overlay', () => { + it('loads default + user, concatenates in order', () => { + const { recipes } = loadRecipes({ + defaultPath: fix('default.json'), + userPath: fix('user.json'), + }); + assert.equal(recipes.length, 4); + assert.equal(recipes[0].name, 'fixture-paywall'); + assert.equal(recipes[1].name, 'fixture-extractor'); + assert.equal(recipes[2].name, 'fixture-extractor'); // user override (same name) + assert.equal(recipes[3].name, 'fixture-user-only'); + }); + + it('reports per-source counts in status', () => { + const { status } = loadRecipes({ + defaultPath: fix('default.json'), + userPath: fix('user.json'), + }); + assert.equal(status.sources.length, 2); + assert.equal(status.sources[0].loaded, 2); + assert.equal(status.sources[1].loaded, 2); + assert.equal(status.rejected, 0); + }); + + it('skips user file silently when absent', () => { + const { status } = loadRecipes({ + defaultPath: fix('default.json'), + userPath: fix('does-not-exist.json'), + }); + assert.equal(status.sources.length, 1); + }); + + it('rejects malformed recipe per-recipe, loads the rest', () => { + const { recipes, status } = loadRecipes({ + defaultPath: fix('default.json'), + userPath: fix('invalid.json'), + }); + assert.equal(recipes.length, 3); // 2 default + 1 valid from invalid.json + assert.equal(status.rejected, 1); + assert.equal(status.sources[1].rejected, 1); + }); +}); diff --git a/test/recipes-matcher.test.js b/test/recipes-matcher.test.js new file mode 100644 index 0000000..06b70f3 --- /dev/null +++ b/test/recipes-matcher.test.js @@ -0,0 +1,137 @@ +import { describe, it } from 'node:test'; +import assert from 'node:assert/strict'; +import { hostMatches, pathMatches, mergeRecipes, matchRecipesAgainst } from '../lib/recipes.js'; + +describe('hostMatches', () => { + it('matches exact hostname', () => { + assert.equal(hostMatches('example.com', 'example.com'), true); + assert.equal(hostMatches('example.com', 'other.com'), false); + }); + + it('is case-insensitive', () => { + assert.equal(hostMatches('Example.COM', 'example.com'), true); + }); + + it('star matches any character sequence including dots', () => { + assert.equal(hostMatches('*.example.com', 'foo.example.com'), true); + assert.equal(hostMatches('*.example.com', 'foo.bar.example.com'), true); + assert.equal(hostMatches('*.example.com', 'example.com'), false); // apex needs explicit entry + assert.equal(hostMatches('*.example.com', 'other.com'), false); + }); + + it('accepts an array — any-of semantics', () => { + assert.equal(hostMatches(['a.com', 'b.com'], 'b.com'), true); + assert.equal(hostMatches(['a.com', 'b.com'], 'c.com'), false); + }); + + it('escapes regex special chars in literal parts', () => { + assert.equal(hostMatches('foo.example.com', 'foo.example.com'), true); + assert.equal(hostMatches('foo.example.com', 'fooXexample.com'), false); // dot is literal + }); +}); + +describe('pathMatches', () => { + it('matches exact path', () => { + assert.equal(pathMatches('/foo', '/foo'), true); + assert.equal(pathMatches('/foo', '/bar'), false); + }); + + it('** matches multiple segments', () => { + assert.equal(pathMatches('/**', '/'), true); + assert.equal(pathMatches('/**', '/a/b/c'), true); + assert.equal(pathMatches('/foo/**', '/foo/a/b'), true); + assert.equal(pathMatches('/foo/**', '/bar/a/b'), false); + }); + + it('* matches single segment (no slashes)', () => { + assert.equal(pathMatches('/foo/*', '/foo/bar'), true); + assert.equal(pathMatches('/foo/*', '/foo/bar/baz'), false); + assert.equal(pathMatches('/foo/*', '/foo/'), false); + }); + + it('mixed * and ** in the same pattern', () => { + assert.equal(pathMatches('/*/issues/*', '/owner/issues/123'), true); + assert.equal(pathMatches('/*/issues/*', '/owner/sub/issues/123'), false); // * = single segment + assert.equal(pathMatches('/*/issues/**', '/owner/issues/123/comment/456'), true); + }); +}); + +describe('mergeRecipes', () => { + it('returns empty merge for no recipes', () => { + const m = mergeRecipes([]); + assert.deepEqual(m.preprocess, []); + assert.deepEqual(m.removeSelectors, []); + assert.equal(m.extractor, undefined); + assert.deepEqual(m.fetch, {}); + }); + + it('concatenates preprocess action lists in order', () => { + const r1 = { preprocess: [{ action: 'remove-attr', selector: 'p', attr: 'aria-hidden' }], select: { remove: [] }, fetch: {} }; + const r2 = { preprocess: [{ action: 'remove-class', selector: 'p', class: 'paywall' }], select: { remove: [] }, fetch: {} }; + const m = mergeRecipes([r1, r2]); + assert.equal(m.preprocess.length, 2); + assert.equal(m.preprocess[0].action, 'remove-attr'); + assert.equal(m.preprocess[1].action, 'remove-class'); + }); + + it('concatenates select.remove lists', () => { + const r1 = { preprocess: [], select: { remove: ['aside'] }, fetch: {} }; + const r2 = { preprocess: [], select: { remove: ['.ads'] }, fetch: {} }; + const m = mergeRecipes([r1, r2]); + assert.deepEqual(m.removeSelectors, ['aside', '.ads']); + }); + + it('extractor is last-wins', () => { + const r1 = { preprocess: [], select: { remove: [] }, fetch: {}, extractor: 'readability' }; + const r2 = { preprocess: [], select: { remove: [] }, fetch: {}, extractor: 'trafilatura' }; + assert.equal(mergeRecipes([r1, r2]).extractor, 'trafilatura'); + }); + + it('fetch fields merge per-key, not as whole object', () => { + const r1 = { preprocess: [], select: { remove: [] }, fetch: { wait_for: '.x' } }; + const r2 = { preprocess: [], select: { remove: [] }, fetch: { mobile_ua: true } }; + const m = mergeRecipes([r1, r2]); + assert.equal(m.fetch.wait_for, '.x'); // from r1, preserved + assert.equal(m.fetch.mobile_ua, true); // from r2 + }); + + it('fetch field last-wins on per-key conflict', () => { + const r1 = { preprocess: [], select: { remove: [] }, fetch: { render: 'force' } }; + const r2 = { preprocess: [], select: { remove: [] }, fetch: { render: 'skip' } }; + assert.equal(mergeRecipes([r1, r2]).fetch.render, 'skip'); + }); +}); + +describe('matchRecipesAgainst', () => { + const recipes = [ + { name: 'a', host: '*.example.com', path: '/**', preprocess: [], select: { remove: [] }, fetch: {} }, + { name: 'b', host: 'github.com', path: '/*/issues/*', preprocess: [], select: { remove: [] }, fetch: { render: 'force' } }, + { name: 'c', host: 'github.com', path: '/**', preprocess: [], select: { remove: [] }, fetch: {} }, + ]; + + it('returns recipes whose host AND path match', () => { + const merged = matchRecipesAgainst(recipes, new URL('https://github.com/owner/issues/123')); + assert.equal(merged.fetch.render, 'force'); // 'b' matched (and 'c'); both apply + }); + + it('skips recipes where path does not match', () => { + const merged = matchRecipesAgainst(recipes, new URL('https://github.com/owner/pulls/1')); + // 'b' does NOT match (path /*/issues/*); 'c' matches; render stays unset + assert.equal(merged.fetch.render, undefined); + }); + + it('returns empty merge when nothing matches', () => { + const merged = matchRecipesAgainst(recipes, new URL('https://other.org/')); + assert.deepEqual(merged.preprocess, []); + assert.equal(merged.extractor, undefined); + }); + + it('matches real GitHub issue URLs (org/repo/issues/N)', () => { + const ghRecipes = [ + { name: 'gh', host: 'github.com', path: '/*/*/issues/*', + preprocess: [], select: { remove: [] }, fetch: { render: 'force' } }, + ]; + const merged = matchRecipesAgainst(ghRecipes, new URL('https://github.com/AeternaLabsHQ/pullmd/issues/10')); + assert.equal(merged.fetch.render, 'force', 'three-segment github path must match /*/*/issues/*'); + }); +}); diff --git a/test/recipes-status-endpoint.test.js b/test/recipes-status-endpoint.test.js new file mode 100644 index 0000000..b4a4809 --- /dev/null +++ b/test/recipes-status-endpoint.test.js @@ -0,0 +1,58 @@ +import { describe, it } from 'node:test'; +import assert from 'node:assert/strict'; +import { createApp } from '../server.js'; +import { loadRecipes } from '../lib/recipes.js'; +import { fileURLToPath } from 'node:url'; +import path from 'node:path'; + +const here = path.dirname(fileURLToPath(import.meta.url)); +const fix = (rel) => path.join(here, 'fixtures/recipes', rel); + +describe('GET /api/recipes/status', () => { + it('returns ok=true with counts when all recipes loaded', async () => { + loadRecipes({ defaultPath: fix('default.json') }); + const app = createApp({ cache: null }); + const server = app.listen(0); + const port = server.address().port; + try { + const res = await fetch(`http://localhost:${port}/api/recipes/status`); + assert.equal(res.status, 200); + const body = await res.json(); + assert.equal(body.ok, true); + assert.equal(body.loaded, 2); + assert.equal(body.rejected, 0); + assert.equal(body.sources.length, 1); + } finally { + server.close(); + } + }); + + it('returns ok=false when there are rejections', async () => { + loadRecipes({ defaultPath: fix('default.json'), userPath: fix('invalid.json') }); + const app = createApp({ cache: null }); + const server = app.listen(0); + const port = server.address().port; + try { + const res = await fetch(`http://localhost:${port}/api/recipes/status`); + assert.equal(res.status, 200); + const body = await res.json(); + assert.equal(body.ok, false); + assert.equal(body.rejected, 1); + } finally { + server.close(); + } + }); + + it('does not require auth (returns 200 without bearer/session)', async () => { + loadRecipes({ defaultPath: fix('default.json') }); + const app = createApp({ cache: null }); + const server = app.listen(0); + const port = server.address().port; + try { + const res = await fetch(`http://localhost:${port}/api/recipes/status`); + assert.equal(res.status, 200); + } finally { + server.close(); + } + }); +}); diff --git a/test/web.test.js b/test/web.test.js index 39ca5d2..4d84f20 100644 --- a/test/web.test.js +++ b/test/web.test.js @@ -1,6 +1,7 @@ import { describe, it, beforeEach, afterEach } from 'node:test'; import assert from 'node:assert/strict'; import { extractWeb } from '../lib/web.js'; +import { matchRecipesAgainst } from '../lib/recipes.js'; // Single-fetch: extractWeb makes exactly ONE request per call. // The Accept header includes text/markdown preference. @@ -560,3 +561,139 @@ describe('cleanDom CMS-pattern preprocessing', () => { assert.match(result.markdown, /A red sunset over mountains/); }); }); + +describe('extractWeb — recipe integration (Hook 0+1)', () => { + // HTML substantial enough that renderDecision returns no on its own + // (sufficient length, multiple paragraphs, no fallback). Ensures the recipe + // render=force flag is the SOLE reason renderClient gets invoked. + const substantialHtml = `Substantial Article

Substantial Article

${'

This is a long paragraph with meaningful content and enough words to clear the eighty-character substantial threshold easily, so multiple of these will produce strong static extraction.

'.repeat(20)}
`; + + it('uses recipe.fetch.render when no query render param', async () => { + const recipes = [{ name: 'r', host: 'example.com', path: '/**', preprocess: [], select: { remove: [] }, fetch: { render: 'force' } }]; + let renderCalled = false; + const fetcher = mockFetch({ + ok: true, + headers: { get: (h) => h === 'content-type' ? 'text/html' : null }, + text: async () => substantialHtml, + arrayBuffer: async () => new TextEncoder().encode(substantialHtml).buffer, + status: 200, + }); + const renderClient = async (url, opts) => { + renderCalled = true; + return substantialHtml; + }; + await extractWeb('https://example.com/', { fetch: fetcher, renderClient, recipes }); + assert.equal(renderCalled, true, 'recipe render=force should trigger renderClient'); + }); + + it('query render=skip wins over recipe render=force', async () => { + const recipes = [{ name: 'r', host: 'example.com', path: '/**', preprocess: [], select: { remove: [] }, fetch: { render: 'force' } }]; + let renderCalled = false; + const fetcher = mockFetch({ + ok: true, + headers: { get: (h) => h === 'content-type' ? 'text/html' : null }, + text: async () => '

x

', + arrayBuffer: async () => new TextEncoder().encode('

x

').buffer, + status: 200, + }); + const renderClient = async () => { renderCalled = true; return ''; }; + await extractWeb('https://example.com/', { fetch: fetcher, renderClient, recipes, render: 'skip' }); + assert.equal(renderCalled, false); + }); +}); + +describe('extractWeb — recipe integration (Hook 2 preprocess + select)', () => { + it('applies recipe preprocess actions before extraction', async () => { + const recipes = [{ + name: 'r', host: 'example.com', path: '/**', + preprocess: [{ action: 'remove-element', selector: 'div.ads-noise' }], + select: { remove: [] }, fetch: {}, + }]; + const html = 'T
' + + '
PREPROCESS-SHOULD-REMOVE-ME
' + + '

A substantial paragraph with enough body text to clear extraction-quality thresholds, ' + + 'this is filler content for the test, more filler content for the test, and even more.

' + + '
'; + const fetcher = mockFetch({ + ok: true, + headers: { get: (h) => h === 'content-type' ? 'text/html' : null }, + text: async () => html, + arrayBuffer: async () => new TextEncoder().encode(html).buffer, + status: 200, + }); + const result = await extractWeb('https://example.com/', { fetch: fetcher, recipes }); + assert.ok(result.markdown.includes('substantial paragraph'), 'body paragraph survives'); + assert.equal(result.markdown.includes('PREPROCESS-SHOULD-REMOVE-ME'), false, + 'recipe preprocess remove-element must strip the noise div'); + }); + + it('extends cleanDom REMOVE_SELECTORS via recipe select.remove', async () => { + const recipes = [{ + name: 'r', host: 'example.com', path: '/**', + preprocess: [], select: { remove: ['div.recipe-only-strip'] }, fetch: {}, + }]; + const html = 'T
' + + '
SELECT-SHOULD-NOT-APPEAR
' + + '

A substantial paragraph with enough body text to clear extraction-quality thresholds for the article container.

' + + '
'; + const fetcher = mockFetch({ + ok: true, + headers: { get: (h) => h === 'content-type' ? 'text/html' : null }, + text: async () => html, + arrayBuffer: async () => new TextEncoder().encode(html).buffer, + status: 200, + }); + const result = await extractWeb('https://example.com/', { fetch: fetcher, recipes }); + assert.equal(result.markdown.includes('SELECT-SHOULD-NOT-APPEAR'), false, + 'recipe select.remove must strip the targeted div'); + }); +}); + +describe('extractWeb — Hook 3 (playwright fetch options)', () => { + it('passes recipe.fetch.wait_for and mobile_ua to renderClient', async () => { + const recipes = [{ + name: 'r', host: 'example.com', path: '/**', + preprocess: [], select: { remove: [] }, + fetch: { render: 'force', wait_for: '.gate', wait_timeout_ms: 3000, mobile_ua: true }, + }]; + let renderOpts; + const fetcher = mockFetch({ + ok: true, + headers: { get: (h) => h === 'content-type' ? 'text/html' : null }, + text: async () => '

x

', + arrayBuffer: async () => new TextEncoder().encode('

x

').buffer, + status: 200, + }); + const renderClient = async (url, opts) => { + renderOpts = opts; + return '

R

rendered substantial body content paragraph for testing pipeline.

'; + }; + await extractWeb('https://example.com/', { fetch: fetcher, renderClient, recipes }); + assert.equal(renderOpts.waitFor, '.gate'); + assert.equal(renderOpts.waitTimeoutMs, 3000); + assert.equal(renderOpts.mobileUa, true); + }); + + it('passes a User-Agent string to renderClient (from the rotation pool)', async () => { + const recipes = [{ + name: 'r', host: 'example.com', path: '/**', + preprocess: [], select: { remove: [] }, + fetch: { render: 'force' }, // force render to exercise the renderClient path + }]; + let renderOpts; + const fetcher = mockFetch({ + ok: true, + headers: { get: (h) => h === 'content-type' ? 'text/html' : null }, + text: async () => '

x

', + arrayBuffer: async () => new TextEncoder().encode('

x

').buffer, + status: 200, + }); + const renderClient = async (url, opts) => { + renderOpts = opts; + return '

R

rendered substantial body content paragraph for testing pipeline.

'; + }; + await extractWeb('https://example.com/', { fetch: fetcher, renderClient, recipes }); + assert.ok(typeof renderOpts.userAgent === 'string', 'userAgent should be a string'); + assert.match(renderOpts.userAgent, /Mozilla\//, 'userAgent should look like a real UA string'); + }); +});