diff --git a/examples/proposals/headless_crawler_integration.md b/examples/proposals/headless_crawler_integration.md new file mode 100644 index 000000000..9691da28d --- /dev/null +++ b/examples/proposals/headless_crawler_integration.md @@ -0,0 +1,477 @@ +# RFC: Headless Crawler Integration for PentAGI + +## Summary + +Issue [#336](https://github.com/vxcontrol/pentagi/issues/336) asks for +a headless crawler / URL discovery capability. The current web testing +flow can brute-force directories with dictionary tools such as `ffuf`, +but it cannot crawl an application to discover its real routes, links, +forms, parameters, and JavaScript endpoints. Dictionary fuzzing guesses +paths from a wordlist; crawling observes the paths the application +actually exposes. The two are complementary, and PentAGI currently has +no first-class, scoped, structured crawler capability. + +This RFC is a design proposal only. It does not add runtime code, a new +tool handler, a crawler client, GraphQL schema, database migrations, +frontend UI, installer logic, provider configuration, Docker image +changes, Docker Compose changes, `.env.example` entries, or generated +files. It proposes how PentAGI could expose crawler-style URL discovery +as an optional capability so maintainers can review scope, safety, and +the artifact model before any implementation lands. + +The proposed direction is a tool-agnostic "crawler backend" (or +"discovery tool") abstraction over candidate tools such as `katana`, +`crawlergo`, `rad`, and `jsfinder`, disabled by default, producing +normalized discovery artifacts (URLs, forms, parameters, JavaScript +endpoints, status codes, source page, depth, and scope decision) that +other PentAGI tools, agents, and reports can consume. + +This work is deliberately kept separate from the BrowserOS MCP browser +backend proposed in issue +[#342](https://github.com/vxcontrol/pentagi/issues/342). That proposal +targets stateful, interactive browser control (navigation, click, +fill/type, form submission, login-page handling). This RFC targets +URL/route/form/link discovery and passive or semi-passive site mapping, +not interactive session automation. The two can complement each other, +and JavaScript-heavy crawling is the natural overlap point, but their +contracts are different: a crawler returns a discovery artifact, while +an interactive browser drives a session. + +## Goals + +- Add a reviewable design for an optional crawler / URL discovery + capability before any runtime work begins. +- Treat the crawler as a tool-agnostic backend with one or more + selectable discovery tools, not a single hardcoded CLI. +- Produce normalized, structured discovery artifacts that capture URLs, + HTTP methods, status codes, forms, parameters, JavaScript endpoints, + the source page, crawl depth, and the in-scope/out-of-scope decision. +- Let PentAGI agents use crawler output to seed `ffuf`/`dirsearch`-style + content discovery, guide browser checks, reduce repeated manual + enumeration, and enrich reports. +- Keep the capability secure by default: disabled unless enabled, scoped + to the authorized target, rate limited, and bounded by depth, page, + and request limits. +- Reuse PentAGI's existing URL classification and screenshot/evidence + paths where practical instead of inventing parallel infrastructure. +- Keep crawler discovery clearly separated from the BrowserOS MCP + interactive browser backend. + +## Non-Goals + +- This RFC does not implement crawler runtime code, a crawler tool + handler, or a crawler client. +- This RFC does not add Docker image changes, Docker Compose services, + installer behavior, `.env.example` entries, new environment variables, + GraphQL types, REST endpoints, database tables, generated files, + provider configuration, or frontend settings. +- This RFC does not replace `ffuf`, `dirsearch`, or any existing content + discovery workflow. Crawling and dictionary fuzzing are complementary. +- This RFC does not provide stateful interactive browser automation such + as clicking through UI, typing into fields, or driving multi-step + login flows. That direction belongs to the BrowserOS MCP browser + backend in issue #342. +- This RFC does not pick a mandatory default crawler tool. `katana`, + `crawlergo`, `rad`, and `jsfinder` are framed as candidates. +- This RFC does not propose credentialed crawling or automatic active + form submission as default behavior. +- This RFC does not choose the final storage shape for crawler + configuration or artifacts. + +## Current Browser and Discovery Behavior + +PentAGI already has two relevant building blocks, but neither performs +structured crawling. + +The browser tool in `backend/pkg/tools/browser.go` is scraper-backed and +request/response oriented. It operates on a single URL at a time and +supports the `markdown`, `html`, and `links` actions, with an optional +screenshot per call through the configured scraper. `SCRAPER_PUBLIC_URL` +and `SCRAPER_PRIVATE_URL` select public or private scraper routing based +on whether the target resolves to a private or public address, and the +tool already classifies local zones and binary URLs to avoid fetching +non-HTML resources. The `links` action returns the links found on one +fetched page, but the browser tool does not recursively follow links, +build a site map, deduplicate routes, extract forms or parameters, +discover endpoints referenced only from JavaScript, or enforce a crawl +scope or budget. It is single-page extraction, not crawling. + +The terminal tool lets the pentester agent run arbitrary command-line +tools inside the isolated pentest container. The pentester prompt's web +testing guidance already lists crawler-adjacent tools (for example +`ffuf`, `gobuster`, `dirsearch`, `feroxbuster`, `httpx`, `katana`, +`hakrawler`, `waybackurls`, and `gau`) as examples the agent may use, +with a reminder to verify availability and install missing tools in the +current image. This means a crawler such as `katana` can in principle be +invoked today, but only as an ad-hoc terminal command. Its output is +unstructured stdout, it is not scoped or budgeted by PentAGI, the +results are not normalized into a reusable artifact, and nothing feeds +the discovered routes back into `ffuf`, the browser tool, or the report. + +The gap from issue #336 is therefore not "no tool can crawl" but "there +is no first-class, scoped, structured crawler capability." Discovery +today depends on whichever CLI the agent happens to run, returns +free-form text, and cannot be reliably reused across subtasks. + +## Proposed Crawler Capability + +A future implementation can introduce a crawler / URL discovery +capability that is conceptually parallel to the existing browser tool: +it wraps a discovery backend, accepts one or more seed URLs plus a +scope, and returns a normalized discovery artifact instead of free-form +text. + +Proposed shape of the capability: + +- Input: one or more seed URLs, an explicit scope (allowed hosts or + scope entries), a crawl mode, and budget limits (maximum depth, pages, + requests, duration, and request rate). +- Crawl mode candidates: + - `passive`: derive URLs from already-collected sources such as link + extraction, archived URL feeds, or saved responses, without sending + new requests to the target beyond what is already authorized. + - `static`: request and parse HTML and linked resources without a + full browser engine. + - `headless`: render pages with a headless browser engine so that + JavaScript-built routes, single-page-application views, and + dynamically generated links can be discovered. +- Output: a structured discovery artifact (see Artifacts and Reporting) + that records each discovered URL with its method, status code, source + page, depth, parameters, forms, JavaScript endpoints, the backend that + found it, and the scope decision. +- Agent-facing result: a concise text summary for the agent (counts, + notable routes, limits reached) plus the structured artifact stored + through PentAGI's existing artifact and memory paths, mirroring how the + browser tool returns text while saving screenshots. + +The capability is observation-focused. It maps what exists; it does not +log in, submit credentials, or perform stateful multi-step interaction. +Where a target genuinely requires JavaScript rendering or authenticated +interaction to reveal routes, that overlaps with the BrowserOS MCP +backend (issue #342); this RFC treats such cases as a handoff boundary, +not as crawler scope creep. + +## Candidate Tools + +The tools named in issue #336 cover complementary discovery styles. They +are candidates for a pluggable backend, not mandated defaults. + +- `katana` (ProjectDiscovery): fast crawler with both a standard and a + headless mode, configurable depth and scope, form and endpoint + awareness, and structured (JSON/JSONL) output. It is already named in + the pentester prompt's web testing list, which makes it a natural + first candidate for a structured backend. +- `crawlergo`: a Chromium-based dynamic crawler aimed at + JavaScript-heavy applications. It exercises the DOM to surface routes + and requests that static crawling misses and emits a request list that + can seed further testing. +- `rad`: a Chromium-based crawler focused on browser-driven crawling + with scope controls, useful where rendering is required to enumerate + application routes. +- `jsfinder`: an endpoint/URL extractor that mines JavaScript files for + references to API paths and routes. It is semi-passive and complements + crawlers by recovering endpoints that are present in scripts but never + linked from rendered pages. + +Adjacent tools already referenced by the pentester prompt, such as +`hakrawler`, `waybackurls`, and `gau`, could be modeled as additional +passive discovery sources behind the same abstraction. The RFC does not +require adopting all of them; it requires that the abstraction not be +hardwired to one CLI. + +## Tool Selection Model + +The capability should expose a `crawler backend` (equivalently a +`discovery tool`) abstraction so that the agent-facing tool surface is +stable while the underlying CLI can vary. + +Proposed selection model: + +- Default is off. No crawler backend is selected and no crawl runs + unless an operator explicitly enables at least one backend. +- Operators register which discovery tools are available and enabled in + the deployment, since tool availability depends on the pentest image. +- Each backend declares the crawl modes it can serve (`passive`, + `static`, `headless`) and any tool-specific limits. +- The agent (or a future planner step) selects an enabled backend and + mode appropriate to the subtask. If a requested mode has no enabled + backend, the capability reports that clearly instead of silently + falling back to an unsafe or out-of-scope behavior. +- All backends normalize into the same discovery artifact schema so that + downstream consumers do not depend on which CLI produced the result. + +No single tool is promoted to a required default. `katana` is a +reasonable first candidate because it already appears in PentAGI's tool +guidance and supports structured output, but the RFC frames it as one +option among several, not a mandate. + +## Agent Workflow + +A structured crawler changes how the pentester agent approaches a web +target. Today the agent fuzzes directories and may run an ad-hoc crawler +whose text output is hard to reuse. With a structured capability, a web +subtask can run discovery once, persist the artifact, and let later +subtasks consume it. + +Illustrative workflow: + +1. Early in a web engagement, the pentester agent runs crawler discovery + against the authorized target within scope and budget limits. +2. The normalized artifact is stored through PentAGI's artifact and + long-term memory paths so it is reusable across subtasks rather than + re-derived. +3. Discovered directories, file extensions, and parameter names seed + `ffuf`/`dirsearch`-style content discovery, making dictionary fuzzing + more targeted instead of purely wordlist-driven. +4. Interesting routes (login, upload, admin, API, parameterized + endpoints) guide focused browser checks, and, where interactive + rendering or login is required, hand off to the BrowserOS MCP backend + (issue #342). +5. Parameterized URLs and JavaScript endpoints inform vulnerability + testing tools the agent already uses (for example `nuclei` or + `sqlmap`) by giving them concrete inputs. +6. The final report is enriched with a route/endpoint inventory derived + from the artifact, reducing repeated manual enumeration and giving + reviewers a clear map of the attack surface. + +This fits PentAGI's existing flow/task/subtask model and result tools +(for example `hack_result` and `report_result`) and its RAG memory, +without introducing hidden lifecycle state. The crawl is an explicit +tool call with a visible, inspectable artifact. + +## Artifacts and Reporting + +Crawler output should become a normalized artifact rather than raw +stdout. The schema below is illustrative, not a final contract. + +Per-entry fields: + +- `url`: the discovered URL. +- `method`: HTTP method observed or inferred (for example GET or POST). +- `status_code`: response status when the crawler requested the URL. +- `source_page`: the page the URL was discovered from. +- `depth`: crawl depth at which the URL was found. +- `scope_decision`: whether the URL was treated as in scope or out of + scope, and why, modeled as a `decision` value plus a short `reason`. +- `parameters`: query or body parameter names associated with the URL. +- `forms`: form action, method, and input field names. +- `js_endpoints`: endpoints referenced from JavaScript for this page. +- `discovered_by`: which backend (for example `katana` or `jsfinder`) + produced the entry. + +An illustrative entry: + +```json +{ + "url": "https://target.example/admin/login", + "method": "GET", + "status_code": 200, + "source_page": "https://target.example/", + "depth": 1, + "scope_decision": { "decision": "in_scope", "reason": "same_origin" }, + "parameters": ["redirect", "lang"], + "forms": [ + { + "action": "https://target.example/admin/login", + "method": "POST", + "inputs": ["username", "password", "csrf_token"] + } + ], + "js_endpoints": ["https://target.example/api/v1/session"], + "discovered_by": "katana" +} +``` + +A run-level summary should accompany the entries: seed URLs, scope +configuration, backend and mode used, counts (pages crawled, URLs, +forms, parameters, JavaScript endpoints), which limits were reached, and +how many entries were dropped as out of scope. Artifacts should be +stored through the same flow-scoped artifact, screenshot, and reporting +paths the browser tool already uses where practical. If the evidence +chain proposal in +[evidence_chain.md](evidence_chain.md) is implemented later, a crawl run +and its artifact are a natural source of toolcall receipts. + +## Scope and Safety + +Crawling reaches more of a target than single-page scraping, so the +capability must be secure by default and explicitly scoped. + +Required safety properties: + +- Disabled or explicit by default. No crawl runs unless an operator + enables a backend and the agent invokes it deliberately. +- Obey flow target scope. Crawling must respect the authorized target + scope of the flow and must not wander to unrelated hosts. +- Depth, page, and request limits. Every crawl is bounded by maximum + depth, maximum pages, maximum total requests, and maximum duration. +- Rate limiting. Request rate and concurrency are capped to avoid + hammering the target or tripping protective controls. +- Same-origin and allowed-host controls. Off-origin and off-allowlist + links are recorded as out of scope by default rather than followed. +- SSRF and private-network protection. The capability should reuse the + browser tool's URL classification so link-local, loopback, metadata, + private, and reserved targets are not crawled unless explicitly + authorized for the engagement. +- `robots.txt` is treated as an operator policy question, not a hard + rule. Authorized engagements differ on whether to honor `robots.txt`, + so the policy (honor, ignore, or record only) should be configurable + and recorded in the artifact, and the default should be conservative. + This RFC does not mandate either honoring or ignoring it. +- Avoid crawling outside authorized targets. Out-of-scope discovery is + recorded for awareness but not actively requested. +- No credentialed crawling by default. The crawler should not assume it + can use stored credentials, cookies, or authenticated sessions. +- Active form submission requires separate approval or policy. Passive + discovery of forms is in scope; submitting forms (which mutates target + state) is a separate, gated capability and is not default behavior. + +These properties keep the capability aligned with PentAGI's lawful +pentesting posture: it maps the authorized attack surface, it does not +broaden scope, automate accounts, or take state-changing actions without +an explicit decision. + +## Configuration Sketch + +This sketch is illustrative only. It is not a proposed `.env.example` +change and does not choose the final storage shape. + +```yaml +crawler: + enabled: false # off by default; operator opt-in + default_backend: none # default crawl backend; values: none, katana, + # crawlergo, rad. Passive extractors such as + # jsfinder are not selectable as the default. + backends: + katana: + enabled: false + modes: [static, headless] # advertised crawl modes + crawlergo: + enabled: false + modes: [headless] + rad: + enabled: false + modes: [headless] + jsfinder: + enabled: false + modes: [passive] # JavaScript endpoint extraction + scope: + follow: same_origin # same_origin | allowlist | scope_entries + allowed_hosts: # used when follow=allowlist + - target.example + scope_entries: # used when follow=scope_entries; illustrative flow + - https://target.example # scope-of-work entries, ideally sourced + - target.example/app # from the flow target scope if available + robots_policy: record_only # honor | ignore | record_only + limits: + max_depth: 3 + max_pages: 500 + max_requests: 2000 + max_duration_seconds: 600 + requests_per_second: 5 + max_concurrency: 5 + credentials: + allow_credentialed_crawl: false # no authenticated crawling by default + active: + allow_form_submission: false # passive discovery only by default + require_approval_for_submission: true +``` + +Implementation notes for a future PR: + +- The crawler backend should be selectable so the agent-facing tool + surface stays stable while the underlying CLI can change. +- Tool identifiers and flags here are illustrative. A real + implementation must validate the canonical tool invocation and output + format for each backend before normalizing it. +- Scope and SSRF checks should integrate with PentAGI's existing URL + classification rather than re-implementing target analysis. +- The final storage mechanism (mounted YAML, database-backed settings, + existing tool configuration, or another maintainer-approved shape) is + intentionally left open. + +## Failure Modes + +A crawler capability should degrade safely and visibly. + +Expected failure modes: + +- The selected backend is not installed or not available in the current + pentest image. +- The target is unreachable, blocks the crawler, or rate-limits it. +- A JavaScript-heavy site yields little in `static` mode and needs a + `headless` backend that is not enabled. +- The crawl hits a trap such as an infinite calendar, faceted search, or + session-id-in-URL pattern and approaches its budget without progress. +- A link points outside the authorized scope. +- The backend emits a very large result set beyond configured limits. +- The backend returns malformed or unexpected output that cannot be + normalized. + +Recommended behavior: + +- Return a clear discovery tool error or partial result to the agent and + continue the agent loop when possible. +- Record the backend, mode, target, scope decision, and which limit was + reached. +- Fail closed for scope, allowlist, SSRF, and approval decisions rather + than retrying with a broader behavior. +- Prefer returning a bounded partial artifact over hanging or crawling + past configured limits. +- Do not retry in a tight loop; rely on bounded budgets and, if retry is + added later, on backoff. + +## Open Questions + +- Should the crawler be exposed as a new first-class tool, as an + extension of the existing browser tool, or wrapped around terminal + execution of the chosen CLI? +- Which backend(s) should be vendored into the pentest image by default, + given that tool availability drives what the agent can actually run? +- Should `headless` crawling reuse a browser engine shared with the + BrowserOS MCP backend (issue #342), or run as a standalone crawler + process? +- Where should crawler configuration and artifacts live so operators can + inspect scope and results without leaking target details? +- How should the discovery artifact integrate with the report, + long-term memory, and the evidence chain proposal? +- What default rate limits, depth, and page budgets are safe for + authorized engagements without being so low that discovery is useless? +- Should semi-passive sources (`waybackurls`, `gau`, `hakrawler`) be + modeled behind the same abstraction or kept as separate passive tools? +- What is the right approval model and UX for the optional active + form-submission capability? +- How should `robots.txt` policy default be chosen, and should it be set + per deployment, per flow, or per crawl? + +## Incremental Milestones + +1. Docs-only RFC. + - Land this proposal so maintainers can review scope, safety, the + backend abstraction, and the artifact model before runtime work + begins. + +2. Artifact schema and single-backend prototype. + - Define the normalized discovery artifact and run summary. + - Wire one backend (for example `katana` in `static` mode) behind the + abstraction, scoped and budget-limited, without exposing it to + agents yet. + +3. Scoped discovery tool for the pentester agent. + - Expose crawler discovery as an explicit, disabled-by-default tool + with scope, depth, page, request, rate, and duration limits. + - Persist the artifact through existing artifact and memory paths. + +4. Downstream integration. + - Use the artifact to seed `ffuf`/`dirsearch`, guide browser checks, + inform parameterized vulnerability testing, and enrich reports. + +5. Optional headless and semi-passive backends. + - Add `headless` backends (for example `crawlergo` or `rad`) and + JavaScript endpoint extraction (`jsfinder`) for JavaScript-heavy + targets. + - Keep `robots.txt` policy explicit, keep credentialed crawling and + active form submission off by default, and gate any state-changing + action behind approval. + +Refs #336