Skip to content

chore: sync upstream firecrawl/main (2026-04-27)#5

Merged
agenticassets merged 638 commits into
mainfrom
chore/sync-upstream-2026-04-20
May 5, 2026
Merged

chore: sync upstream firecrawl/main (2026-04-27)#5
agenticassets merged 638 commits into
mainfrom
chore/sync-upstream-2026-04-20

Conversation

@cayman-openclaw
Copy link
Copy Markdown
Collaborator

@cayman-openclaw cayman-openclaw commented Apr 20, 2026

Weekly upstream sync refresh.

Summary

  • Merged upstream/main at 9015e7df1 into the existing integration branch chore/sync-upstream-2026-04-20
  • Merge was clean, no manual conflict resolution required
  • Fork drift before this refresh: 21 fork-only commits vs 597 upstream-only commits on origin/main...upstream/main
  • Updated existing integration PR chore: sync upstream firecrawl/main (2026-04-27) #5 instead of opening a duplicate PR

Notable upstream changes included in this refresh

Validation

  • apps/js-sdk/firecrawl: pnpm test:unit ✅ (48 tests passed)
  • apps/api: validation blocked locally by missing Rust toolchain / cargo; pnpm install fails while building native package @mendable/firecrawl-rs, and pnpm build then cannot resolve that package

Notes

nickscamara and others added 30 commits March 23, 2026 17:44
* Nick:

* Nick:

* Update scrape-browser.ts
…l#3223)

* Update scrape-browser.ts

* Update apps/api/src/controllers/v2/scrape-browser.ts

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

---------

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
…l#3224)

Co-authored-by: firecrawl-spring[bot] <254786068+firecrawl-spring[bot]@users.noreply.github.com>
Co-authored-by: micahstairs <micah@sideguide.dev>
* feat: audio format

* rust formatting

* address cubic feedback

* address cubic feedback

* fix billing order
* feat(avgrab): get site support from service directly

* ficx

* Update apps/api/src/scraper/scrapeURL/transformers/audio.ts

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>

---------

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
…l#3228)

rotate-api-key and agent-signup-block controllers delete API keys from
the database but do not invalidate the ACUC Redis cache (600s TTL),
allowing revoked keys to authenticate for up to 10 minutes. Add
clearACUC() calls after key deletion in both controllers, matching the
pattern already used in the dashboard revoke action and the
agentSignupConfirmController.

Co-authored-by: firecrawl-spring[bot] <254786068+firecrawl-spring[bot]@users.noreply.github.com>
Co-authored-by: micahstairs <micah@sideguide.dev>
Upgrade autumn-js from 1.0.0-beta.10 to 1.1.6 (latest stable).

Co-authored-by: firecrawl-spring[bot] <254786068+firecrawl-spring[bot]@users.noreply.github.com>
Co-authored-by: micahstairs <micah@sideguide.dev>
* fix(deps): override vulnerable transitive dependencies

Add pnpm overrides for picomatch (<4.0.2), yaml (<2.7.1), and
smol-toml (<1.3.2) across all affected packages to resolve CI
security audit failures.

* fix(deps): correct override versions for picomatch, yaml, smol-toml

picomatch >=4.0.4, yaml >=2.8.3, smol-toml >=1.6.1
* chore(deps): upgrade pdf-inspector to 44092bc

* chore(deps): upgrade pdf-inspector to 4d52d7a
Update Feature Overview table and dedicated section to reflect the
rename from Browse to Interact, with updated code examples showing
the scrape-then-interact workflow, prompt-based interaction, code
execution, and persistent profiles.
docs(readme): replace Browse with Interact section
…recrawl#3239)

Co-authored-by: firecrawl-spring[bot] <254786068+firecrawl-spring[bot]@users.noreply.github.com>
Co-authored-by: micahstairs <micah@sideguide.dev>
Co-authored-by: Micah Stairs <micah.stairs@gmail.com>
devin-ai-integration Bot and others added 29 commits April 30, 2026 07:09
… lifecycle methods

- Empty/blank API keys explicitly passed via builder.apiKey() now throw
  immediately instead of silently falling back to environment variables
- Add missing async methods: startCrawlAsync, getCrawlStatusAsync,
  cancelCrawlAsync, startBatchScrapeAsync, getBatchScrapeStatusAsync,
  cancelBatchScrapeAsync, startAgentAsync, getAgentStatusAsync,
  cancelAgentAsync, getConcurrencyAsync, getCreditUsageAsync

Co-Authored-By: gaurav <gauravchadha1676@gmail.com>
…nc crawl

- Add limit > 0 validation to sync and async map() methods
- Add limit > 0 validation to async crawl _prepare_crawl_request()
  (sync version already had this validation)
- Negative or zero limits now raise ValueError consistently across
  all methods: search, crawl, and map

Co-Authored-By: gaurav <gauravchadha1676@gmail.com>
…anging

Timeout paths in both WebSocket and pollLoop were calling close()
without emitting an error event, which would cause the start() Promise
to hang indefinitely. Now emits an error event before closing so the
Promise properly resolves.

Co-Authored-By: gaurav <gauravchadha1676@gmail.com>
StartCrawl, Map, and Search now reject non-positive limit values
with an error instead of passing them through to the API.

Co-Authored-By: gaurav <gauravchadha1676@gmail.com>
Co-Authored-By: gaurav <gauravchadha1676@gmail.com>
Co-Authored-By: gaurav <gauravchadha1676@gmail.com>
Co-Authored-By: gaurav <gauravchadha1676@gmail.com>
Co-Authored-By: gaurav <gauravchadha1676@gmail.com>
Co-Authored-By: gaurav <gauravchadha1676@gmail.com>
Co-Authored-By: gaurav <gauravchadha1676@gmail.com>
…hange)

Co-Authored-By: gaurav <gauravchadha1676@gmail.com>
…-go-sdk-qa-issues

fix(go-sdk): add client-side validation for negative limit values
…-python-sdk-qa-issues

fix(python-sdk): add missing negative limit validation in map and async crawl
…-java-sdk-qa-issues

fix(java-sdk): reject explicitly empty API keys and add missing async lifecycle methods
…-nodejs-sdk-watcher-issues

fix(js-sdk): fix watcher duplicate events, start() Promise resolution, and done event fields
…-ruby-sdk-qa-issues

fix(ruby-sdk): unwrap credit_usage data field and default skipTlsVerification to false
…_timeout (firecrawl#3467)

* fix(nuq-postgres): spread reindexes per-index, bound maintenance with statement_timeout

The single daily REINDEX TABLE CONCURRENTLY on nuq.queue_scrape (and
queue_crawl_finished) at 09:00 UTC could run for 8+ hours on the live
queue, holding ShareUpdateExclusiveLock and blocking autovacuum the
whole time. With autovacuum stalled, queue_scrape accumulated 23% dead
tuples, the visibility map went stale (index-only scans were doing 5x
heap fetches), and worker pool starvation surfaced as multi-second
nuqHealthCheck (SELECT 1), nuqRenewLock and nuqGetJobToProcess
latencies.

Changes:
- Replace the two whole-table REINDEX crons with per-index
  REINDEX INDEX CONCURRENTLY jobs spread across 02:00-08:20 UTC, 20 min
  apart. Per-index reindex takes its lock only on the target index, and
  the staggered cadence prevents any one job from stacking against the
  next.
- Each reindex cron sets statement_timeout = 20min, so a stuck job
  self-aborts instead of silently sitting for hours.
- Add statement_timeout to the two cleanup crons we observed hanging
  in production: nuq_queue_scrape_backlog_reaper (50s) and
  nuq_group_crawl_clean (4min).
- Add nuq_queue_scrape_backlog_times_out_at_idx so the per-minute
  backlog reaper does an index range scan instead of a seq scan over
  the whole backlog (~3M rows).

Operator note: nuq.sql only runs at initdb. Apply this change to the
running primary manually:

  CREATE INDEX CONCURRENTLY IF NOT EXISTS
    nuq_queue_scrape_backlog_times_out_at_idx
    ON nuq.queue_scrape_backlog (times_out_at);

  SELECT cron.unschedule('nuq_queue_scrape_reindex');
  SELECT cron.unschedule('nuq_queue_crawl_finished_reindex');
  -- then run the new SELECT cron.schedule(...) lines from this file.

Co-Authored-By: mogery <mogery@sideguide.dev>

* fix(nuq-postgres): prune cron.job_run_details hourly

pg_cron does not auto-trim cron.job_run_details and the table has no
default index on start_time, so with sub-minute jobs it grows unbounded
and any time-scoped query against it ends up seq-scanning the whole
history. Keep the last 24h.

Co-Authored-By: mogery <mogery@sideguide.dev>

* fix(nuq-postgres): make reindex crons single-statement, add watchdog

REINDEX CONCURRENTLY cannot run inside a transaction block, and pg_cron
wraps multi-statement command bodies in an implicit transaction. So the
inline `SET statement_timeout = '20min'; REINDEX INDEX CONCURRENTLY ...`
form fails with `REINDEX CONCURRENTLY cannot run inside a transaction
block`.

Drop the inline SET and add a 5-minute watchdog cron that cancels any
nuq reindex running longer than 25 minutes. The watchdog provides the
same self-aborting safety the inline timeout was meant to give, without
the tx-block restriction.

Co-Authored-By: mogery <mogery@sideguide.dev>

* fix(nuq-postgres): tighten reindex watchdog to actually cap < slot cadence

The previous watchdog (every 5 min, threshold 25 min) could cancel a
stuck reindex as late as ~30 min in, overlapping the next 20-min slot.
Run the watchdog every minute with an 18-min threshold so the worst-case
runtime is ~19 min, strictly under the 20-min cadence.

Co-Authored-By: mogery <mogery@sideguide.dev>

---------

Co-authored-by: firecrawl-spring[bot] <254786068+firecrawl-spring[bot]@users.noreply.github.com>
Co-authored-by: mogery <mogery@sideguide.dev>
…exes (firecrawl#3468)

* fix(nuq-postgres): batch group cleanup and add predicate-matching indexes

Cap nuq_group_crawl_clean at 10000 groups per run with FOR UPDATE SKIP
LOCKED so the cascading deletes can't outrun the 5min schedule. Add
partial indexes that match the standalone (group_id IS NULL) cleaners'
predicates so they stop seq-scanning the 18M-row queue_scrape table on
every tick.

Co-Authored-By: mogery <mogery@sideguide.dev>

* fix(nuq-postgres): schedule REINDEX for standalone partial indexes

Co-Authored-By: mogery <mogery@sideguide.dev>

* fix(nuq-postgres): add plain group_id indexes and victim-selection index

EXPLAIN on the cascading DELETE in nuq_group_crawl_clean showed a seq scan
over 21.8M rows of queue_scrape -- ~7s for just 100 group_ids -- because
every existing (group_id, ...) index was filtered by mode='single_urls' or
status, so none covered DELETE WHERE group_id IN (...). queue_crawl_finished
had no group_id index at all. group_crawl had no index for the
status='completed' AND expires_at < now() victim selection (the existing
partial is on status='active', the opposite predicate).

Add three plain indexes plus matching REINDEX schedules.

Co-Authored-By: mogery <mogery@sideguide.dev>

* fix(nuq-postgres): drop nuq_group_crawl_clean LIMIT to 2000

10000 was too aggressive given the variance in group sizes (p50=9 jobs/group
but max=8495). One outlier group landing in a batch was enough to blow past
the 4min statement_timeout even with the new group_id index. Steady state
arrival is ~1600 groups/tick, so 2000 leaves margin while keeping worst-case
row counts safe.

Co-Authored-By: mogery <mogery@sideguide.dev>

* fix(nuq-postgres): cleaner runs every minute with LIMIT 500 and 90s timeout

Per-row delete cost on queue_scrape (12 indexes, scattered heap pages, ~86
rows/s measured) is the ceiling, not group selection. Bigger batches just
let one heavy outlier group blow the timeout. Smaller batches at faster
cadence get steady-state runs into ~10s and bound worst-case under the
tighter 90s timeout. If a tick still fails, the next minute retries instead
of holding a 4min transaction.

Co-Authored-By: mogery <mogery@sideguide.dev>

* fix(nuq-postgres): drop unused completed-created-at index on queue_scrape

Mirror prod: nuq_queue_scrape_completed_created_at_idx had ~9k scans vs
650M+ on its peers (3 orders of magnitude less) and was 367 MB. Replaced
for the standalone cleaner by the new _standalone_ partial index. Dropped
on prod after confirming no caller; remove from schema and unschedule its
REINDEX cron.

Co-Authored-By: mogery <mogery@sideguide.dev>

---------

Co-authored-by: firecrawl-spring[bot] <254786068+firecrawl-spring[bot]@users.noreply.github.com>
Co-authored-by: mogery <mogery@sideguide.dev>
Route audio scrapes through Chrome CDP so YouTube cookies can be collected and forwarded to avgrab for authenticated yt-dlp downloads.

Made-with: Cursor
* feat(search): includeDomains/excludeDomains

* bump sdks

---------

Co-authored-by: Nicolas <20311743+nickscamara@users.noreply.github.com>
…irecrawl#3473)

When the only requested format is audio, send skipYouTubeTranscript:true
to fire-engine so the chrome-cdp worker doesn't inject the transcript
extraction script on YouTube watch URLs. Audio-only callers don't
consume the transcript markdown, and the script's button click triggers
a YouTube SPA main-frame swap that's currently producing widespread CDP
errors and 30s+ tail latency on YouTube scrapes.

Co-authored-by: firecrawl-spring[bot] <254786068+firecrawl-spring[bot]@users.noreply.github.com>
Co-authored-by: mogery <mogery@sideguide.dev>
…irecrawl#3474)

zerolog was configured with TimeFormatUnix, which writes integer Unix
seconds. Cloud Logging then ingests every entry with second-level
resolution, so high-throughput log lines from this service collapse into
a single timestamp bucket and sort unpredictably against other services'
logs (which use RFC3339 with sub-second precision). Switching to
RFC3339Nano preserves the original event ordering end-to-end.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@agenticassets agenticassets merged commit a48d105 into main May 5, 2026
0 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.