chore: sync upstream firecrawl/main (2026-04-27) by cayman-openclaw · Pull Request #5 · agenticassets/firecrawl

cayman-openclaw · 2026-04-20T14:03:06Z

Weekly upstream sync refresh.

Summary

Merged upstream/main at 9015e7df1 into the existing integration branch chore/sync-upstream-2026-04-20
Merge was clean, no manual conflict resolution required
Fork drift before this refresh: 21 fork-only commits vs 597 upstream-only commits on origin/main...upstream/main
Updated existing integration PR chore: sync upstream firecrawl/main (2026-04-27) #5 instead of opening a duplicate PR

Notable upstream changes included in this refresh

Revert fix(nuq-postgres): reap orphaned NULL times_out_at backlog rows (fix(nuq-postgres): reap orphaned NULL times_out_at backlog rows firecrawl/firecrawl#3413)
feat(docker): expose HARNESS_STARTUP_TIMEOUT_MS via docker-compose env (feat(docker): expose HARNESS_STARTUP_TIMEOUT_MS via docker-compose env firecrawl/firecrawl#3447)
fix(api/nuq): cap crawl backlog timeout at 48h instead of Infinity (fix(api/nuq): cap crawl backlog timeout at 48h instead of Infinity firecrawl/firecrawl#3450)
fix(api/cclog): log Supabase insert errors (fix(api/cclog): log Supabase insert errors firecrawl/firecrawl#3444)
feat(js-sdk): explicit axios timeout (feat(js-sdk): explicit timeout firecrawl/firecrawl#3440)
fix(api/query): update direct quote model
fix(pdf): raise Fire PDF cap to 30MB and compare raw bytes (fix(pdf): raise Fire PDF cap to 30MB and compare raw bytes firecrawl/firecrawl#3436)
Prior sync payload already on this branch also includes the lockdown, audio, native calamine, scrape retry, and SDK updates merged since the 2026-04-13 baseline

Validation

apps/js-sdk/firecrawl: pnpm test:unit ✅ (48 tests passed)
apps/api: validation blocked locally by missing Rust toolchain / cargo; pnpm install fails while building native package @mendable/firecrawl-rs, and pnpm build then cannot resolve that package

Notes

PR chore: sync upstream firecrawl/main (2026-04-27) #5 remains the active integration candidate and supersedes the older sync PRs (chore: sync fork with upstream/main while preserving custom overlays #3 and chore: sync upstream firecrawl/main (2026-04-13) #4)

* Nick: * Nick: * Update scrape-browser.ts

…l#3223) * Update scrape-browser.ts * Update apps/api/src/controllers/v2/scrape-browser.ts Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com> --------- Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

…l#3224) Co-authored-by: firecrawl-spring[bot] <254786068+firecrawl-spring[bot]@users.noreply.github.com> Co-authored-by: micahstairs <micah@sideguide.dev>

* feat: audio format * rust formatting * address cubic feedback * address cubic feedback * fix billing order

* feat(avgrab): get site support from service directly * ficx * Update apps/api/src/scraper/scrapeURL/transformers/audio.ts Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com> --------- Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

…l#3228) rotate-api-key and agent-signup-block controllers delete API keys from the database but do not invalidate the ACUC Redis cache (600s TTL), allowing revoked keys to authenticate for up to 10 minutes. Add clearACUC() calls after key deletion in both controllers, matching the pattern already used in the dashboard revoke action and the agentSignupConfirmController. Co-authored-by: firecrawl-spring[bot] <254786068+firecrawl-spring[bot]@users.noreply.github.com> Co-authored-by: micahstairs <micah@sideguide.dev>

Upgrade autumn-js from 1.0.0-beta.10 to 1.1.6 (latest stable). Co-authored-by: firecrawl-spring[bot] <254786068+firecrawl-spring[bot]@users.noreply.github.com> Co-authored-by: micahstairs <micah@sideguide.dev>

* fix(deps): override vulnerable transitive dependencies Add pnpm overrides for picomatch (<4.0.2), yaml (<2.7.1), and smol-toml (<1.3.2) across all affected packages to resolve CI security audit failures. * fix(deps): correct override versions for picomatch, yaml, smol-toml picomatch >=4.0.4, yaml >=2.8.3, smol-toml >=1.6.1

* chore(deps): upgrade pdf-inspector to 44092bc * chore(deps): upgrade pdf-inspector to 4d52d7a

Update Feature Overview table and dedicated section to reflect the rename from Browse to Interact, with updated code examples showing the scrape-then-interact workflow, prompt-based interaction, code execution, and persistent profiles.

docs(readme): replace Browse with Interact section

…recrawl#3239) Co-authored-by: firecrawl-spring[bot] <254786068+firecrawl-spring[bot]@users.noreply.github.com> Co-authored-by: micahstairs <micah@sideguide.dev> Co-authored-by: Micah Stairs <micah.stairs@gmail.com>

…m/firecrawl/firecrawl into New-readme-scrape-search-interact

… lifecycle methods - Empty/blank API keys explicitly passed via builder.apiKey() now throw immediately instead of silently falling back to environment variables - Add missing async methods: startCrawlAsync, getCrawlStatusAsync, cancelCrawlAsync, startBatchScrapeAsync, getBatchScrapeStatusAsync, cancelBatchScrapeAsync, startAgentAsync, getAgentStatusAsync, cancelAgentAsync, getConcurrencyAsync, getCreditUsageAsync Co-Authored-By: gaurav <gauravchadha1676@gmail.com>

…nc crawl - Add limit > 0 validation to sync and async map() methods - Add limit > 0 validation to async crawl _prepare_crawl_request() (sync version already had this validation) - Negative or zero limits now raise ValueError consistently across all methods: search, crawl, and map Co-Authored-By: gaurav <gauravchadha1676@gmail.com>

…anging Timeout paths in both WebSocket and pollLoop were calling close() without emitting an error event, which would cause the start() Promise to hang indefinitely. Now emits an error event before closing so the Promise properly resolves. Co-Authored-By: gaurav <gauravchadha1676@gmail.com>

StartCrawl, Map, and Search now reject non-positive limit values with an error instead of passing them through to the API. Co-Authored-By: gaurav <gauravchadha1676@gmail.com>

Co-Authored-By: gaurav <gauravchadha1676@gmail.com>

…hange) Co-Authored-By: gaurav <gauravchadha1676@gmail.com>

…-go-sdk-qa-issues fix(go-sdk): add client-side validation for negative limit values

…-python-sdk-qa-issues fix(python-sdk): add missing negative limit validation in map and async crawl

…-java-sdk-qa-issues fix(java-sdk): reject explicitly empty API keys and add missing async lifecycle methods

…-nodejs-sdk-watcher-issues fix(js-sdk): fix watcher duplicate events, start() Promise resolution, and done event fields

…-ruby-sdk-qa-issues fix(ruby-sdk): unwrap credit_usage data field and default skipTlsVerification to false

…_timeout (firecrawl#3467) * fix(nuq-postgres): spread reindexes per-index, bound maintenance with statement_timeout The single daily REINDEX TABLE CONCURRENTLY on nuq.queue_scrape (and queue_crawl_finished) at 09:00 UTC could run for 8+ hours on the live queue, holding ShareUpdateExclusiveLock and blocking autovacuum the whole time. With autovacuum stalled, queue_scrape accumulated 23% dead tuples, the visibility map went stale (index-only scans were doing 5x heap fetches), and worker pool starvation surfaced as multi-second nuqHealthCheck (SELECT 1), nuqRenewLock and nuqGetJobToProcess latencies. Changes: - Replace the two whole-table REINDEX crons with per-index REINDEX INDEX CONCURRENTLY jobs spread across 02:00-08:20 UTC, 20 min apart. Per-index reindex takes its lock only on the target index, and the staggered cadence prevents any one job from stacking against the next. - Each reindex cron sets statement_timeout = 20min, so a stuck job self-aborts instead of silently sitting for hours. - Add statement_timeout to the two cleanup crons we observed hanging in production: nuq_queue_scrape_backlog_reaper (50s) and nuq_group_crawl_clean (4min). - Add nuq_queue_scrape_backlog_times_out_at_idx so the per-minute backlog reaper does an index range scan instead of a seq scan over the whole backlog (~3M rows). Operator note: nuq.sql only runs at initdb. Apply this change to the running primary manually: CREATE INDEX CONCURRENTLY IF NOT EXISTS nuq_queue_scrape_backlog_times_out_at_idx ON nuq.queue_scrape_backlog (times_out_at); SELECT cron.unschedule('nuq_queue_scrape_reindex'); SELECT cron.unschedule('nuq_queue_crawl_finished_reindex'); -- then run the new SELECT cron.schedule(...) lines from this file. Co-Authored-By: mogery <mogery@sideguide.dev> * fix(nuq-postgres): prune cron.job_run_details hourly pg_cron does not auto-trim cron.job_run_details and the table has no default index on start_time, so with sub-minute jobs it grows unbounded and any time-scoped query against it ends up seq-scanning the whole history. Keep the last 24h. Co-Authored-By: mogery <mogery@sideguide.dev> * fix(nuq-postgres): make reindex crons single-statement, add watchdog REINDEX CONCURRENTLY cannot run inside a transaction block, and pg_cron wraps multi-statement command bodies in an implicit transaction. So the inline `SET statement_timeout = '20min'; REINDEX INDEX CONCURRENTLY ...` form fails with `REINDEX CONCURRENTLY cannot run inside a transaction block`. Drop the inline SET and add a 5-minute watchdog cron that cancels any nuq reindex running longer than 25 minutes. The watchdog provides the same self-aborting safety the inline timeout was meant to give, without the tx-block restriction. Co-Authored-By: mogery <mogery@sideguide.dev> * fix(nuq-postgres): tighten reindex watchdog to actually cap < slot cadence The previous watchdog (every 5 min, threshold 25 min) could cancel a stuck reindex as late as ~30 min in, overlapping the next 20-min slot. Run the watchdog every minute with an 18-min threshold so the worst-case runtime is ~19 min, strictly under the 20-min cadence. Co-Authored-By: mogery <mogery@sideguide.dev> --------- Co-authored-by: firecrawl-spring[bot] <254786068+firecrawl-spring[bot]@users.noreply.github.com> Co-authored-by: mogery <mogery@sideguide.dev>

…exes (firecrawl#3468) * fix(nuq-postgres): batch group cleanup and add predicate-matching indexes Cap nuq_group_crawl_clean at 10000 groups per run with FOR UPDATE SKIP LOCKED so the cascading deletes can't outrun the 5min schedule. Add partial indexes that match the standalone (group_id IS NULL) cleaners' predicates so they stop seq-scanning the 18M-row queue_scrape table on every tick. Co-Authored-By: mogery <mogery@sideguide.dev> * fix(nuq-postgres): schedule REINDEX for standalone partial indexes Co-Authored-By: mogery <mogery@sideguide.dev> * fix(nuq-postgres): add plain group_id indexes and victim-selection index EXPLAIN on the cascading DELETE in nuq_group_crawl_clean showed a seq scan over 21.8M rows of queue_scrape -- ~7s for just 100 group_ids -- because every existing (group_id, ...) index was filtered by mode='single_urls' or status, so none covered DELETE WHERE group_id IN (...). queue_crawl_finished had no group_id index at all. group_crawl had no index for the status='completed' AND expires_at < now() victim selection (the existing partial is on status='active', the opposite predicate). Add three plain indexes plus matching REINDEX schedules. Co-Authored-By: mogery <mogery@sideguide.dev> * fix(nuq-postgres): drop nuq_group_crawl_clean LIMIT to 2000 10000 was too aggressive given the variance in group sizes (p50=9 jobs/group but max=8495). One outlier group landing in a batch was enough to blow past the 4min statement_timeout even with the new group_id index. Steady state arrival is ~1600 groups/tick, so 2000 leaves margin while keeping worst-case row counts safe. Co-Authored-By: mogery <mogery@sideguide.dev> * fix(nuq-postgres): cleaner runs every minute with LIMIT 500 and 90s timeout Per-row delete cost on queue_scrape (12 indexes, scattered heap pages, ~86 rows/s measured) is the ceiling, not group selection. Bigger batches just let one heavy outlier group blow the timeout. Smaller batches at faster cadence get steady-state runs into ~10s and bound worst-case under the tighter 90s timeout. If a tick still fails, the next minute retries instead of holding a 4min transaction. Co-Authored-By: mogery <mogery@sideguide.dev> * fix(nuq-postgres): drop unused completed-created-at index on queue_scrape Mirror prod: nuq_queue_scrape_completed_created_at_idx had ~9k scans vs 650M+ on its peers (3 orders of magnitude less) and was 367 MB. Replaced for the standalone cleaner by the new _standalone_ partial index. Dropped on prod after confirming no caller; remove from schema and unschedule its REINDEX cron. Co-Authored-By: mogery <mogery@sideguide.dev> --------- Co-authored-by: firecrawl-spring[bot] <254786068+firecrawl-spring[bot]@users.noreply.github.com> Co-authored-by: mogery <mogery@sideguide.dev>

Route audio scrapes through Chrome CDP so YouTube cookies can be collected and forwarded to avgrab for authenticated yt-dlp downloads. Made-with: Cursor

* feat(search): includeDomains/excludeDomains * bump sdks --------- Co-authored-by: Nicolas <20311743+nickscamara@users.noreply.github.com>

…irecrawl#3473) When the only requested format is audio, send skipYouTubeTranscript:true to fire-engine so the chrome-cdp worker doesn't inject the transcript extraction script on YouTube watch URLs. Audio-only callers don't consume the transcript markdown, and the script's button click triggers a YouTube SPA main-frame swap that's currently producing widespread CDP errors and 30s+ tail latency on YouTube scrapes. Co-authored-by: firecrawl-spring[bot] <254786068+firecrawl-spring[bot]@users.noreply.github.com> Co-authored-by: mogery <mogery@sideguide.dev>

…irecrawl#3474) zerolog was configured with TimeFormatUnix, which writes integer Unix seconds. Cloud Logging then ingests every entry with second-level resolution, so high-throughput log lines from this service collapse into a single timestamp bucket and sort unpredictably against other services' logs (which use RFC3339 with sub-second precision). Switching to RFC3339Nano preserves the original event ordering end-to-end. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-2026-04-20

nickscamara and others added 30 commits March 23, 2026 17:44

Nick: fix billing

02b2cee

Update browser-agent.ts

be14226

feat(pdf): Add ocrPageCount and ocrPageRatio to Logs (firecrawl#3218)

c2f4339

chore: update pdf-inspector to ad10f67 (firecrawl#3220)

2bb96e0

Nick: updates to scrape-browser

b6e90bd

(feat/interact) Interact Billing (firecrawl#3222)

b5e2c8b

* Nick: * Nick: * Update scrape-browser.ts

Nick: update browser

a3e93d7

fix(autumn): upgrade error logging for billing API failures (firecraw…

bddbd73

…l#3224) Co-authored-by: firecrawl-spring[bot] <254786068+firecrawl-spring[bot]@users.noreply.github.com> Co-authored-by: micahstairs <micah@sideguide.dev>

fix(api/test): bad test running on selfhost suite

3a0da67

fix(api/test): another bad test

8120682

feat: audio format (ENG-4708) (firecrawl#3226)

37889a3

* feat: audio format * rust formatting * address cubic feedback * address cubic feedback * fix billing order

chore(deps): upgrade autumn-js to 1.1.6 (firecrawl#3230)

ccda217

Upgrade autumn-js from 1.0.0-beta.10 to 1.1.6 (latest stable). Co-authored-by: firecrawl-spring[bot] <254786068+firecrawl-spring[bot]@users.noreply.github.com> Co-authored-by: micahstairs <micah@sideguide.dev>

feat(api/map): use avgrab resolve (firecrawl#3231)

87e6f17

chore(deps): upgrade pdf-inspector to 4d52d7a (firecrawl#3232)

ab3bd24

* chore(deps): upgrade pdf-inspector to 44092bc * chore(deps): upgrade pdf-inspector to 4d52d7a

fix(api/scrape): billing for sites unblocked via flag (firecrawl#3237)

1669570

docs(readme): replace Browse with Interact section

24eadc9

Update Feature Overview table and dedicated section to reflect the rename from Browse to Interact, with updated code examples showing the scrape-then-interact workflow, prompt-based interaction, code execution, and persistent profiles.

docs(readme): add Interact via Prompting subsection, replace em dashes

1a05039

Update README.md

9a68a3a

Merge pull request firecrawl#3238 from firecrawl/rhys/1774556339140

5d7dcbe

docs(readme): replace Browse with Interact section

Update README.md

dab8922

Update README.md

0abe448

Update README.md

e1c1a9c

Merge branch 'main' into New-readme-scrape-search-interact

dc1a90a

Update README.md

f234f49

Merge branch 'New-readme-scrape-search-interact' of https://github.co…

2e90b3e

…m/firecrawl/firecrawl into New-readme-scrape-search-interact

devin-ai-integration Bot and others added 29 commits April 30, 2026 07:09

fix(go-sdk): add client-side validation for negative limit values

e77b4f4

StartCrawl, Map, and Search now reject non-positive limit values with an error instead of passing them through to the API. Co-Authored-By: gaurav <gauravchadha1676@gmail.com>

chore(ruby-sdk): bump version to 1.2.1

4ee29b8

Co-Authored-By: gaurav <gauravchadha1676@gmail.com>

chore(js-sdk): bump version to 4.20.1

3c4a990

Co-Authored-By: gaurav <gauravchadha1676@gmail.com>

chore(java-sdk): bump version to 1.3.1

1cf04fb

Co-Authored-By: gaurav <gauravchadha1676@gmail.com>

chore(python-sdk): bump version to 4.23.1

363d006

Co-Authored-By: gaurav <gauravchadha1676@gmail.com>

chore(go-sdk): bump version to 1.1.1

e754697

Co-Authored-By: gaurav <gauravchadha1676@gmail.com>

chore(ruby-sdk): bump version to 1.3.0 (minor - breaking default change)

b17bd23

Co-Authored-By: gaurav <gauravchadha1676@gmail.com>

chore(java-sdk): bump version to 1.4.0 (minor - breaking validation c…

1157d03

…hange) Co-Authored-By: gaurav <gauravchadha1676@gmail.com>

Merge pull request firecrawl#3465 from firecrawl/devin/1777533492-fix…

8b99fee

…-go-sdk-qa-issues fix(go-sdk): add client-side validation for negative limit values

Merge pull request firecrawl#3464 from firecrawl/devin/1777532977-fix…

9d0e1e2

…-python-sdk-qa-issues fix(python-sdk): add missing negative limit validation in map and async crawl

Merge pull request firecrawl#3463 from firecrawl/devin/1777532789-fix…

a822436

…-java-sdk-qa-issues fix(java-sdk): reject explicitly empty API keys and add missing async lifecycle methods

Merge pull request firecrawl#3462 from firecrawl/devin/1777532668-fix…

2ada97c

…-nodejs-sdk-watcher-issues fix(js-sdk): fix watcher duplicate events, start() Promise resolution, and done event fields

Merge pull request firecrawl#3461 from firecrawl/devin/1777532537-fix…

82601ac

…-ruby-sdk-qa-issues fix(ruby-sdk): unwrap credit_usage data field and default skipTlsVerification to false

fix(scrape): swap grok model

58c3fc3

fix(scrape):

df4b468

Nick: min browser billing

3ee8d19

fix(scrape): pass CDP cookies to audio downloads (firecrawl#3472)

6b2651d

Route audio scrapes through Chrome CDP so YouTube cookies can be collected and forwarded to avgrab for authenticated yt-dlp downloads. Made-with: Cursor

feat(search): includeDomains/excludeDomains (firecrawl#3466)

105fc99

* feat(search): includeDomains/excludeDomains * bump sdks --------- Co-authored-by: Nicolas <20311743+nickscamara@users.noreply.github.com>

feat(api/scrape/youtube): use new metadata method

2823806

chore(api/scrape/youtube): remove unused f-e flag

5d2e3f7

chore:

0efe08f

Merge remote-tracking branch 'upstream/main' into chore/sync-upstream…

64c1487

…-2026-04-20

agenticassets merged commit a48d105 into main May 5, 2026
0 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: sync upstream firecrawl/main (2026-04-27)#5

chore: sync upstream firecrawl/main (2026-04-27)#5
agenticassets merged 638 commits into
mainfrom
chore/sync-upstream-2026-04-20

cayman-openclaw commented Apr 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

Conversation

cayman-openclaw commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

cayman-openclaw commented Apr 20, 2026 •

edited

Loading