Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
faff168
feat(recipes): add Recipe + Action Zod schemas
syswave-dev May 6, 2026
8f814de
chore(recipes): drop unused ActionEnumSchema re-export
syswave-dev May 6, 2026
9637a68
feat(recipes): loadRecipes for default file
syswave-dev May 6, 2026
4295f92
test(recipes): user overlay loading and recipe-level rejection
syswave-dev May 6, 2026
5b512d9
feat(recipes): hostMatches glob with array-any semantics
syswave-dev May 6, 2026
f3437b2
feat(recipes): pathMatches glob with single/multi-segment wildcards
syswave-dev May 6, 2026
431f9f5
feat(recipes): matchRecipes filter + merge (concat lists, last-wins s…
syswave-dev May 6, 2026
19c5f6f
feat(recipes): applyPreprocessActions for all four actions
syswave-dev May 6, 2026
ae56bce
feat(cache): add meta table and recipes_invalidated_at setter
syswave-dev May 6, 2026
748baf5
feat(cache): get() honors recipes_invalidated_at
syswave-dev May 6, 2026
03aef1c
feat(recipes): hash recipes content and invalidate cache on change
syswave-dev May 6, 2026
98d6af5
feat(web): extractWeb honors recipe fetch options (render/extractor)
syswave-dev May 6, 2026
fec7161
test(web): make Hook 0+1 render-force test discriminating
syswave-dev May 6, 2026
8196f4a
feat(web): extractWeb applies recipe preprocess actions and select.re…
syswave-dev May 6, 2026
be2fadc
test(web): make Hook 2 preprocess + select tests discriminating
syswave-dev May 6, 2026
a3d7c3e
feat(playwright): forward wait_for/mobile_ua/wait_timeout_ms from recipe
syswave-dev May 6, 2026
e4a5ced
feat(server): GET /api/recipes/status (public, in-memory)
syswave-dev May 6, 2026
98e18a8
feat(server): load recipes at boot and stamp cache-invalidation
syswave-dev May 6, 2026
f67b99b
feat(recipes): seed default recipes for Future PLC and GitHub Issues
syswave-dev May 6, 2026
75dcda8
feat(playwright-sidecar): accept waitFor, waitTimeoutMs, mobileUa
syswave-dev May 6, 2026
ca93b7e
chore: bump to v2.2.0 + CHANGELOG + MIGRATION for recipe engine
syswave-dev May 6, 2026
e314387
docs(changelog): v2.2.0 entry for recipe engine
syswave-dev May 6, 2026
e1947a7
feat(playwright-sidecar): bundle playwright-stealth for headless dete…
syswave-dev May 6, 2026
5742e03
feat(playwright): forward rotated User-Agent from pool to sidecar
syswave-dev May 6, 2026
8754d09
feat(playwright-sidecar): accept userAgent override (defaults to hard…
syswave-dev May 6, 2026
04494e0
fix(recipes): github-issues path glob — match org/repo/issues/N (3 se…
syswave-dev May 6, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,42 @@
# Changelog

## v2.2.0 — 2026-05-XX

### Added

- **Site Recipe Engine** (#18). Declarative `site-recipes.json` for per-host preprocess, fetch, select, and extractor rules. Default recipes ship in the repo (`site-recipes.default.json`); self-hosters can mount `data/site-recipes.json` or set `PULLMD_SITE_RECIPES` to point elsewhere. Four recipe categories:
- `preprocess` — DOM cleanup actions (`remove-attr`, `remove-class`, `remove-element`, `unwrap`) applied before extraction
- `fetch` — render forcing (`render: force|skip`), wait-for selector, mobile UA
- `select` — extra remove-selectors added to `cleanDom`
- `extractor` — preferred extractor per host (`readability`, `trafilatura`, `playwright`)
- New endpoint `GET /api/recipes/status` (public, no auth) — counts loaded/rejected recipes per source for monitoring.
- Cache invalidation on recipe change. When recipe content changes between server boots, all cache rows become stale and re-extract on next access (lazy, on-demand).
- Playwright sidecar accepts new optional fields: `waitFor` (CSS selector), `waitTimeoutMs` (capped at 15000), `mobileUa` (boolean). Backwards compatible — old fields are silently passed through.
- Initial default recipes covering Future PLC sites (paywall + recommendation widgets) and GitHub Issues (JS-rendered comments).
- The Playwright sidecar bundles `playwright-stealth` to mitigate `navigator.webdriver`-style headless detection on JS-driven anti-bot pages.

### Known limitations

- **Sites behind cookie-based consent walls** (third-party CMP frameworks like TCF v2) are not unlocked by recipes alone in this release. Such sites redirect non-consenting visitors to a JS-rendered consent UI and only return article content once HttpOnly cookies are set after a click. A future release will add a `fetch.cookies` recipe field so operators can paste their own consent state when they choose to. For now, write a custom recipe with whatever combination of `select.remove`, `extractor`, and `fetch` settings works for your specific source — the engine supports the experimentation, the defaults stay conservative.

### Important — `:latest` tag stays on v1.x

The `:latest` tag in Docker Hub and GHCR remains pinned to v1.x until the scheduled flip on 2026-05-16. Self-hosters wanting the recipe engine **must pin `:v2.2.0`** (or `:2.2`) explicitly for both `pullmd` and `pullmd-playwright`. Pulling `:latest` continues to give you v1, **without** the recipe engine.

```yaml
services:
pullmd:
image: aeternalabshq/pullmd:2.2.0
playwright:
image: aeternalabshq/pullmd-playwright:2.2.0
```

### Migration

- New `meta` table created automatically on first boot. No action required.
- Existing cache rows remain valid until the first recipe content change is detected.
- See `MIGRATION.md` for the full upgrade path.

## v2.1.0 — 2026-05-05

### Added
Expand Down
65 changes: 65 additions & 0 deletions MIGRATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,3 +69,68 @@ If something goes wrong:
4. Restart.

The `users`/`sessions`/`api_keys`/`user_fetches` tables and the `user_id` column on `conversions` are unused by v1.x and can stay if you're not restoring; v1.x ignores them.

# Migrating from v2.1.x to v2.2.0

v2.2.0 ships the Site Recipe Engine (#18). Pure additive change — existing instances keep working unchanged. This section covers what to know if you want to use recipes.

## Pin v2 tags explicitly

`:latest` stays on v1.x until 2026-05-16. Update your compose / k8s manifests:

```yaml
# Before
image: aeternalabshq/pullmd:latest
# After
image: aeternalabshq/pullmd:2.2.0
# Also bump the playwright sidecar — wait_for and mobile_ua need the new sidecar:
image: aeternalabshq/pullmd-playwright:2.2.0
```

## Optional: mount user recipes

The default recipes in `site-recipes.default.json` cover Future PLC sites and GitHub Issues out of the box. To add your own:

```yaml
services:
pullmd:
image: aeternalabshq/pullmd:2.2.0
volumes:
- ./data:/app/data
# Drop your custom recipes at ./data/site-recipes.json on the host
# PullMD auto-discovers it. Or set PULLMD_SITE_RECIPES to a different path:
environment:
- PULLMD_SITE_RECIPES=/path/to/your/recipes.json
```

User recipes are concatenated with the defaults. On scalar conflicts (e.g. both define `extractor` for the same host), the user file wins via ordering.

## Schema migrations

The `meta` table is created automatically on first boot — no manual SQL. Existing cache rows remain valid until the first recipe content change is detected (the SHA256 of recipe file content is hashed at boot; on change, `recipes_invalidated_at` is bumped and old cache rows lazy-refresh on next access).

## Monitoring

`GET /api/recipes/status` returns `{ ok, loaded, rejected, sources }` — public, no auth. Add it to UptimeKuma / Healthchecks / equivalent to be alerted when a recipe fails to parse:

```json
{
"ok": true,
"loaded": 5,
"rejected": 0,
"sources": [
{ "path": "site-recipes.default.json", "loaded": 4, "rejected": 0 },
{ "path": "/app/data/site-recipes.json", "loaded": 1, "rejected": 0 }
]
}
```

`ok = (rejected === 0)`. HTTP always returns 200; use the `ok` field for monitoring decisions. Rejection details are in stderr at server start (`docker logs pullmd | grep recipes`).

## Rolling back to v2.1.x

The schema change is additive (new `meta` table, no column changes on existing tables). To roll back:

1. Stop v2.2.0 container.
2. Pin to `aeternalabshq/pullmd:2.1.0`.
3. Restart. The `meta` table stays — v2.1.x ignores it.
34 changes: 32 additions & 2 deletions lib/cache.js
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,14 @@ export function createCache(dbPath = '/data/cache.db') {
db.exec(`CREATE INDEX IF NOT EXISTS idx_user_fetches_fetched_at ON user_fetches(fetched_at)`);
db.exec(`CREATE UNIQUE INDEX IF NOT EXISTS idx_user_fetches_unique ON user_fetches(user_id, cache_id)`);

db.exec(`
CREATE TABLE IF NOT EXISTS meta (
key TEXT PRIMARY KEY,
value TEXT NOT NULL,
updated_at TEXT DEFAULT (datetime('now'))
)
`);

// Migrate: add share_id column if missing
const cols = db.prepare("PRAGMA table_info(conversions)").all().map(c => c.name);
if (!cols.includes('share_id')) {
Expand All @@ -109,6 +117,8 @@ export function createCache(dbPath = '/data/cache.db') {
db.exec('CREATE INDEX IF NOT EXISTS idx_conversions_user_id ON conversions(user_id)');
}

let recipesInvalidatedAt = '1970-01-01 00:00:00';

const stmts = {
upsert: db.prepare(`
INSERT INTO conversions (url, title, markdown, source, share_id, client, user_id, created_at)
Expand All @@ -124,7 +134,9 @@ export function createCache(dbPath = '/data/cache.db') {
`),
get: db.prepare(`
SELECT title, markdown, source, share_id, client, created_at FROM conversions
WHERE url = ? AND created_at > datetime('now', '-1 hour')
WHERE url = ?
AND created_at > datetime('now', '-1 hour')
AND created_at > ?
`),
getByShareId: db.prepare(`
SELECT url, title, markdown, source, client, created_at FROM conversions
Expand Down Expand Up @@ -192,6 +204,11 @@ export function createCache(dbPath = '/data/cache.db') {
LIMIT ? OFFSET ?
`),
countForUser: db.prepare(`SELECT COUNT(*) as total FROM user_fetches WHERE user_id = ?`),
metaGet: db.prepare(`SELECT value FROM meta WHERE key = ?`),
metaSet: db.prepare(`
INSERT INTO meta (key, value, updated_at) VALUES (?, ?, datetime('now'))
ON CONFLICT(key) DO UPDATE SET value = excluded.value, updated_at = datetime('now')
`),
};

return {
Expand All @@ -214,7 +231,8 @@ export function createCache(dbPath = '/data/cache.db') {
},

get(url) {
return stmts.get.get(url) || null;
const row = stmts.get.get(url, recipesInvalidatedAt);
return row || null;
},

getByShareId(shareId) {
Expand Down Expand Up @@ -305,5 +323,17 @@ export function createCache(dbPath = '/data/cache.db') {
}));
return { total, window, bySource, lowQualityDomains, fallbackByDomain };
},

getMeta(key) {
const row = stmts.metaGet.get(key);
return row ? row.value : null;
},
setMeta(key, value) {
stmts.metaSet.run(key, value);
},
setRecipesInvalidatedAt(iso) {
recipesInvalidatedAt = iso;
stmts.metaSet.run('recipes_invalidated_at', iso);
},
};
}
10 changes: 8 additions & 2 deletions lib/playwright-client.js
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ const SIDECAR_TIMEOUT_MS = 25_000;
* @param {typeof fetch} [opts.fetch] Injectable for tests
* @returns {Promise<string>} rendered HTML
*/
export async function renderViaSidecar(url, { signal, fetch: fetchFn = globalThis.fetch } = {}) {
export async function renderViaSidecar(url, { signal, fetch: fetchFn = globalThis.fetch, waitFor, waitTimeoutMs, mobileUa, userAgent } = {}) {
if (!process.env.PLAYWRIGHT_URL) throw new Error('Playwright sidecar not configured (PLAYWRIGHT_URL env)');

const ctrl = new AbortController();
Expand All @@ -20,11 +20,17 @@ export async function renderViaSidecar(url, { signal, fetch: fetchFn = globalThi
else signal.addEventListener('abort', onAbort, { once: true });
}

const body = { url };
if (waitFor !== undefined) body.waitFor = waitFor;
if (waitTimeoutMs !== undefined) body.waitTimeoutMs = waitTimeoutMs;
if (mobileUa !== undefined) body.mobileUa = mobileUa;
if (userAgent !== undefined) body.userAgent = userAgent;

try {
const res = await fetchFn(process.env.PLAYWRIGHT_URL, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ url }),
body: JSON.stringify(body),
signal: ctrl.signal,
});
if (!res.ok) throw new Error(`Sidecar returned ${res.status}`);
Expand Down
Loading
Loading