chore: sync upstream firecrawl/main (2026-04-27)#5
Merged
Conversation
* Nick: * Nick: * Update scrape-browser.ts
…l#3223) * Update scrape-browser.ts * Update apps/api/src/controllers/v2/scrape-browser.ts Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com> --------- Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
…l#3224) Co-authored-by: firecrawl-spring[bot] <254786068+firecrawl-spring[bot]@users.noreply.github.com> Co-authored-by: micahstairs <micah@sideguide.dev>
* feat: audio format * rust formatting * address cubic feedback * address cubic feedback * fix billing order
* feat(avgrab): get site support from service directly * ficx * Update apps/api/src/scraper/scrapeURL/transformers/audio.ts Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com> --------- Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
…l#3228) rotate-api-key and agent-signup-block controllers delete API keys from the database but do not invalidate the ACUC Redis cache (600s TTL), allowing revoked keys to authenticate for up to 10 minutes. Add clearACUC() calls after key deletion in both controllers, matching the pattern already used in the dashboard revoke action and the agentSignupConfirmController. Co-authored-by: firecrawl-spring[bot] <254786068+firecrawl-spring[bot]@users.noreply.github.com> Co-authored-by: micahstairs <micah@sideguide.dev>
Upgrade autumn-js from 1.0.0-beta.10 to 1.1.6 (latest stable). Co-authored-by: firecrawl-spring[bot] <254786068+firecrawl-spring[bot]@users.noreply.github.com> Co-authored-by: micahstairs <micah@sideguide.dev>
* fix(deps): override vulnerable transitive dependencies Add pnpm overrides for picomatch (<4.0.2), yaml (<2.7.1), and smol-toml (<1.3.2) across all affected packages to resolve CI security audit failures. * fix(deps): correct override versions for picomatch, yaml, smol-toml picomatch >=4.0.4, yaml >=2.8.3, smol-toml >=1.6.1
* chore(deps): upgrade pdf-inspector to 44092bc * chore(deps): upgrade pdf-inspector to 4d52d7a
Update Feature Overview table and dedicated section to reflect the rename from Browse to Interact, with updated code examples showing the scrape-then-interact workflow, prompt-based interaction, code execution, and persistent profiles.
docs(readme): replace Browse with Interact section
…recrawl#3239) Co-authored-by: firecrawl-spring[bot] <254786068+firecrawl-spring[bot]@users.noreply.github.com> Co-authored-by: micahstairs <micah@sideguide.dev> Co-authored-by: Micah Stairs <micah.stairs@gmail.com>
…m/firecrawl/firecrawl into New-readme-scrape-search-interact
… lifecycle methods - Empty/blank API keys explicitly passed via builder.apiKey() now throw immediately instead of silently falling back to environment variables - Add missing async methods: startCrawlAsync, getCrawlStatusAsync, cancelCrawlAsync, startBatchScrapeAsync, getBatchScrapeStatusAsync, cancelBatchScrapeAsync, startAgentAsync, getAgentStatusAsync, cancelAgentAsync, getConcurrencyAsync, getCreditUsageAsync Co-Authored-By: gaurav <gauravchadha1676@gmail.com>
…nc crawl - Add limit > 0 validation to sync and async map() methods - Add limit > 0 validation to async crawl _prepare_crawl_request() (sync version already had this validation) - Negative or zero limits now raise ValueError consistently across all methods: search, crawl, and map Co-Authored-By: gaurav <gauravchadha1676@gmail.com>
…anging Timeout paths in both WebSocket and pollLoop were calling close() without emitting an error event, which would cause the start() Promise to hang indefinitely. Now emits an error event before closing so the Promise properly resolves. Co-Authored-By: gaurav <gauravchadha1676@gmail.com>
StartCrawl, Map, and Search now reject non-positive limit values with an error instead of passing them through to the API. Co-Authored-By: gaurav <gauravchadha1676@gmail.com>
Co-Authored-By: gaurav <gauravchadha1676@gmail.com>
Co-Authored-By: gaurav <gauravchadha1676@gmail.com>
Co-Authored-By: gaurav <gauravchadha1676@gmail.com>
Co-Authored-By: gaurav <gauravchadha1676@gmail.com>
Co-Authored-By: gaurav <gauravchadha1676@gmail.com>
Co-Authored-By: gaurav <gauravchadha1676@gmail.com>
…hange) Co-Authored-By: gaurav <gauravchadha1676@gmail.com>
…-go-sdk-qa-issues fix(go-sdk): add client-side validation for negative limit values
…-python-sdk-qa-issues fix(python-sdk): add missing negative limit validation in map and async crawl
…-java-sdk-qa-issues fix(java-sdk): reject explicitly empty API keys and add missing async lifecycle methods
…-nodejs-sdk-watcher-issues fix(js-sdk): fix watcher duplicate events, start() Promise resolution, and done event fields
…-ruby-sdk-qa-issues fix(ruby-sdk): unwrap credit_usage data field and default skipTlsVerification to false
…_timeout (firecrawl#3467) * fix(nuq-postgres): spread reindexes per-index, bound maintenance with statement_timeout The single daily REINDEX TABLE CONCURRENTLY on nuq.queue_scrape (and queue_crawl_finished) at 09:00 UTC could run for 8+ hours on the live queue, holding ShareUpdateExclusiveLock and blocking autovacuum the whole time. With autovacuum stalled, queue_scrape accumulated 23% dead tuples, the visibility map went stale (index-only scans were doing 5x heap fetches), and worker pool starvation surfaced as multi-second nuqHealthCheck (SELECT 1), nuqRenewLock and nuqGetJobToProcess latencies. Changes: - Replace the two whole-table REINDEX crons with per-index REINDEX INDEX CONCURRENTLY jobs spread across 02:00-08:20 UTC, 20 min apart. Per-index reindex takes its lock only on the target index, and the staggered cadence prevents any one job from stacking against the next. - Each reindex cron sets statement_timeout = 20min, so a stuck job self-aborts instead of silently sitting for hours. - Add statement_timeout to the two cleanup crons we observed hanging in production: nuq_queue_scrape_backlog_reaper (50s) and nuq_group_crawl_clean (4min). - Add nuq_queue_scrape_backlog_times_out_at_idx so the per-minute backlog reaper does an index range scan instead of a seq scan over the whole backlog (~3M rows). Operator note: nuq.sql only runs at initdb. Apply this change to the running primary manually: CREATE INDEX CONCURRENTLY IF NOT EXISTS nuq_queue_scrape_backlog_times_out_at_idx ON nuq.queue_scrape_backlog (times_out_at); SELECT cron.unschedule('nuq_queue_scrape_reindex'); SELECT cron.unschedule('nuq_queue_crawl_finished_reindex'); -- then run the new SELECT cron.schedule(...) lines from this file. Co-Authored-By: mogery <mogery@sideguide.dev> * fix(nuq-postgres): prune cron.job_run_details hourly pg_cron does not auto-trim cron.job_run_details and the table has no default index on start_time, so with sub-minute jobs it grows unbounded and any time-scoped query against it ends up seq-scanning the whole history. Keep the last 24h. Co-Authored-By: mogery <mogery@sideguide.dev> * fix(nuq-postgres): make reindex crons single-statement, add watchdog REINDEX CONCURRENTLY cannot run inside a transaction block, and pg_cron wraps multi-statement command bodies in an implicit transaction. So the inline `SET statement_timeout = '20min'; REINDEX INDEX CONCURRENTLY ...` form fails with `REINDEX CONCURRENTLY cannot run inside a transaction block`. Drop the inline SET and add a 5-minute watchdog cron that cancels any nuq reindex running longer than 25 minutes. The watchdog provides the same self-aborting safety the inline timeout was meant to give, without the tx-block restriction. Co-Authored-By: mogery <mogery@sideguide.dev> * fix(nuq-postgres): tighten reindex watchdog to actually cap < slot cadence The previous watchdog (every 5 min, threshold 25 min) could cancel a stuck reindex as late as ~30 min in, overlapping the next 20-min slot. Run the watchdog every minute with an 18-min threshold so the worst-case runtime is ~19 min, strictly under the 20-min cadence. Co-Authored-By: mogery <mogery@sideguide.dev> --------- Co-authored-by: firecrawl-spring[bot] <254786068+firecrawl-spring[bot]@users.noreply.github.com> Co-authored-by: mogery <mogery@sideguide.dev>
…exes (firecrawl#3468) * fix(nuq-postgres): batch group cleanup and add predicate-matching indexes Cap nuq_group_crawl_clean at 10000 groups per run with FOR UPDATE SKIP LOCKED so the cascading deletes can't outrun the 5min schedule. Add partial indexes that match the standalone (group_id IS NULL) cleaners' predicates so they stop seq-scanning the 18M-row queue_scrape table on every tick. Co-Authored-By: mogery <mogery@sideguide.dev> * fix(nuq-postgres): schedule REINDEX for standalone partial indexes Co-Authored-By: mogery <mogery@sideguide.dev> * fix(nuq-postgres): add plain group_id indexes and victim-selection index EXPLAIN on the cascading DELETE in nuq_group_crawl_clean showed a seq scan over 21.8M rows of queue_scrape -- ~7s for just 100 group_ids -- because every existing (group_id, ...) index was filtered by mode='single_urls' or status, so none covered DELETE WHERE group_id IN (...). queue_crawl_finished had no group_id index at all. group_crawl had no index for the status='completed' AND expires_at < now() victim selection (the existing partial is on status='active', the opposite predicate). Add three plain indexes plus matching REINDEX schedules. Co-Authored-By: mogery <mogery@sideguide.dev> * fix(nuq-postgres): drop nuq_group_crawl_clean LIMIT to 2000 10000 was too aggressive given the variance in group sizes (p50=9 jobs/group but max=8495). One outlier group landing in a batch was enough to blow past the 4min statement_timeout even with the new group_id index. Steady state arrival is ~1600 groups/tick, so 2000 leaves margin while keeping worst-case row counts safe. Co-Authored-By: mogery <mogery@sideguide.dev> * fix(nuq-postgres): cleaner runs every minute with LIMIT 500 and 90s timeout Per-row delete cost on queue_scrape (12 indexes, scattered heap pages, ~86 rows/s measured) is the ceiling, not group selection. Bigger batches just let one heavy outlier group blow the timeout. Smaller batches at faster cadence get steady-state runs into ~10s and bound worst-case under the tighter 90s timeout. If a tick still fails, the next minute retries instead of holding a 4min transaction. Co-Authored-By: mogery <mogery@sideguide.dev> * fix(nuq-postgres): drop unused completed-created-at index on queue_scrape Mirror prod: nuq_queue_scrape_completed_created_at_idx had ~9k scans vs 650M+ on its peers (3 orders of magnitude less) and was 367 MB. Replaced for the standalone cleaner by the new _standalone_ partial index. Dropped on prod after confirming no caller; remove from schema and unschedule its REINDEX cron. Co-Authored-By: mogery <mogery@sideguide.dev> --------- Co-authored-by: firecrawl-spring[bot] <254786068+firecrawl-spring[bot]@users.noreply.github.com> Co-authored-by: mogery <mogery@sideguide.dev>
Route audio scrapes through Chrome CDP so YouTube cookies can be collected and forwarded to avgrab for authenticated yt-dlp downloads. Made-with: Cursor
* feat(search): includeDomains/excludeDomains * bump sdks --------- Co-authored-by: Nicolas <20311743+nickscamara@users.noreply.github.com>
…irecrawl#3473) When the only requested format is audio, send skipYouTubeTranscript:true to fire-engine so the chrome-cdp worker doesn't inject the transcript extraction script on YouTube watch URLs. Audio-only callers don't consume the transcript markdown, and the script's button click triggers a YouTube SPA main-frame swap that's currently producing widespread CDP errors and 30s+ tail latency on YouTube scrapes. Co-authored-by: firecrawl-spring[bot] <254786068+firecrawl-spring[bot]@users.noreply.github.com> Co-authored-by: mogery <mogery@sideguide.dev>
…irecrawl#3474) zerolog was configured with TimeFormatUnix, which writes integer Unix seconds. Cloud Logging then ingests every entry with second-level resolution, so high-throughput log lines from this service collapse into a single timestamp bucket and sort unpredictably against other services' logs (which use RFC3339 with sub-second precision). Switching to RFC3339Nano preserves the original event ordering end-to-end. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Weekly upstream sync refresh.
Summary
upstream/mainat9015e7df1into the existing integration branchchore/sync-upstream-2026-04-2021fork-only commits vs597upstream-only commits onorigin/main...upstream/mainNotable upstream changes included in this refresh
fix(nuq-postgres): reap orphaned NULL times_out_at backlog rows(fix(nuq-postgres): reap orphaned NULL times_out_at backlog rows firecrawl/firecrawl#3413)feat(docker): expose HARNESS_STARTUP_TIMEOUT_MS via docker-compose env(feat(docker): expose HARNESS_STARTUP_TIMEOUT_MS via docker-compose env firecrawl/firecrawl#3447)fix(api/nuq): cap crawl backlog timeout at 48h instead of Infinity(fix(api/nuq): cap crawl backlog timeout at 48h instead of Infinity firecrawl/firecrawl#3450)fix(api/cclog): log Supabase insert errors(fix(api/cclog): log Supabase insert errors firecrawl/firecrawl#3444)feat(js-sdk): explicit axios timeout(feat(js-sdk): explicit timeout firecrawl/firecrawl#3440)fix(api/query): update direct quote modelfix(pdf): raise Fire PDF cap to 30MB and compare raw bytes(fix(pdf): raise Fire PDF cap to 30MB and compare raw bytes firecrawl/firecrawl#3436)Validation
apps/js-sdk/firecrawl:pnpm test:unit✅ (48 tests passed)apps/api: validation blocked locally by missing Rust toolchain /cargo;pnpm installfails while building native package@mendable/firecrawl-rs, andpnpm buildthen cannot resolve that packageNotes