site-mapper-agents

LLM-once API discovery + self-healing extraction for any browser-accessible portal.

Burst-record CDP network traffic from a portal you have a browser session on, hand it to a three-agent team, get back a typed schema + signatures you can extract from forever — with auto-repair when the portal's API shape drifts.

The problem

Every SaaS portal has a different API. Writing extractors for each is a treadmill — and the schemas change without warning, so your extractors silently break.

Pre-built connectors only cover the top 20 platforms. For everything else (internal CRMs, niche-vertical tools, undocumented partner portals) you either pay someone to reverse-engineer the API, or you give up and scrape the DOM.

This library is the third option.

What this solves

Onboarding: you point the system at a portal you have a real browser session on. It records a burst of CDP network traffic while you click around, then asks an LLM once to classify which endpoints carry the data you want and to map response JSON keys to your fields. Output is a typed SiteSchema and a list of NetworkSignature patterns.
Extraction: from that point forward, every CDP event is matched against the saved signatures with pure Pydantic validation — sub-millisecond, no LLM calls, no cost.
Self-healing: when the portal changes its response shape, an ExtractionFailed event fires. The Healer compares the old key map against the new response, fixes what it can deterministically, and asks the LLM to semantically match the rest. Confident patches auto-apply. Borderline patches surface for human review.

The three agents

                   ┌──────────────────────────────────────────────┐
                   │  Browser session → CDP forwarder → events    │
                   └──────────────────────┬───────────────────────┘
                                          │
   ┌──────────────────────┐               │              ┌───────────────────────┐
   │     Architect        │ ◀── once ──── │ ──── live ──▶│     Eavesdropper      │
   │  (LLM classifies     │               │              │  (Pydantic only,      │
   │   endpoints, builds  │               │              │   sub-ms hot path)    │
   │   SiteSchema +       │               │              │                       │
   │   signatures)        │               │              │   emits ExtractionResult
   └──────────────────────┘               │              │   or ExtractionFailed │
              │                           │              └───────────┬───────────┘
              ▼                           │                          │
   ╔══════════════════════╗               │              ┌───────────▼───────────┐
   ║   MappedSite +       ║◀──── heals ───┼──────────────│       Healer          │
   ║   NetworkSignatures  ║               │              │  (LLM re-maps stale   │
   ╚══════════════════════╝               │              │   keys, auto-applies  │
                                          │              │   confident patches)  │
                                          │              └───────────────────────┘

Architect — runs once. Expensive. Produces the schema.
Eavesdropper — runs on every event. Free. Pure validation.
Healer — runs only on failures. Costs nothing when nothing breaks.

Install

pip install site-mapper-agents

For the runnable examples you'll also want a pydantic-ai provider:

pip install 'pydantic-ai[anthropic]'   # or [openai], [ollama], ...

Quickstart

import asyncio
from pydantic_ai.models.test import TestModel

from site_mapper_agents import (
    Architect,
    CDPNetworkEvent,
    Eavesdropper,
    TargetField,
    UserIntent,
)

# 1. Tell the system what you want to extract.
intent = UserIntent(
    description="Customer account details",
    target_fields=[
        TargetField(name="account_id", description="Account UUID"),
        TargetField(name="email", description="Primary contact email"),
    ],
)

# 2. Construct the Architect. Replace TestModel with a real provider.
architect = Architect(model=TestModel())  # or AnthropicModel("claude-sonnet-4-5")

# 3. Feed it a burst of CDP traffic (your forwarder produced these).
architect.record_traffic(CDPNetworkEvent(
    request_id="r1",
    url="https://crm.example.com/api/v2/accounts/42",
    method="GET",
    body={"data": {"client": {"id": "acct_42", "email": "ada@example.com"}}},
))

# 4. Ask the Architect to propose a schema.
async def onboard():
    proposal = await architect.propose(
        target_url="https://crm.example.com/accounts",
        user_intent=intent,
    )
    site = architect.build_mapped_site(
        proposal=proposal,
        target_url="https://crm.example.com/accounts",
        user_intent=intent,
    )
    return site

site = asyncio.run(onboard())

# 5. From now on, every live CDP event runs through the Eavesdropper.
eaves = Eavesdropper()
result, event = eaves.ingest(
    CDPNetworkEvent(
        request_id="r2",
        url="https://crm.example.com/api/v2/accounts/99",
        method="GET",
        body={"data": {"client": {"id": "acct_99", "email": "g@example.com"}}},
    ),
    sites=[site],
)
print(result.data_payload if result else "no match")

API reference

`Architect(model=None, vocabulary=None, policy=DEFAULT_ONBOARDING_POLICY, model_settings=None)`

The onboarding agent. LLM-once.

Parameter	Type	Notes
`model`	`pydantic_ai.Model \| None`	Any pydantic-ai model. `None` → heuristic.
`vocabulary`	`list[EndpointType] \| None`	Caller-supplied classifications. See below.
`policy`	`OnboardingPolicy`	Sample-count thresholds.
`model_settings`	`ModelSettings \| None`	max_tokens, temperature, etc.

Methods:

record_traffic(event) — buffer a CDP event during onboarding.
record_click() — mark that the user clicked something.
has_enough_samples() → bool — policy check.
detect_endpoints() → list[DetectedEndpoint] — deterministic pre-processing.
await propose(*, target_url, user_intent, llm_classify=None) → ArchitectProposal — the main entry point.
build_mapped_site(*, proposal, target_url, user_intent) → MappedSite — promote an approved proposal to an active site.
emit_event(site, *, success=True, reason="") → SiteMapped | OnboardingFailed.
reset() — clear buffers for the next onboarding session.

`Eavesdropper(policy=DEFAULT_EXTRACTION_POLICY)`

The runtime agent. No LLM. Pure Pydantic validation.

Methods:

ingest(event, sites) → (ExtractionResult | None, ExtractionSucceeded | ExtractionFailed | None).

`Healer(model=None, policy=DEFAULT_HEALING_POLICY, model_settings=None)`

The self-healing agent.

Methods:

await diagnose(*, site, failed_event, new_response_body=None, llm_semantic_match=None) → HealerPatch.
apply_patch(site, patch) → (bool, SchemaHealed | HealingFailed | SiteDegraded).

Models

Class	Purpose
`CDPNetworkEvent`	One captured network response. Library input.
`TargetField`	One data point the caller wants extracted.
`UserIntent`	A bundle of target fields with a human description.
`EndpointType`	One entry in the Architect's classification vocabulary.
`DetectedEndpoint`	Pre-LLM view of a unique endpoint.
`NetworkSignature`	URL pattern + JSON-key map. Saved per site.
`SiteSchema`	The extraction contract for one intent.
`ArchitectProposal`	Architect's structured output before user confirms.
`HealerPatch`	Healer's structured output for one repair attempt.
`MappedSite`	Aggregate root — schemas + signatures + status.
`ExtractionResult`	Eavesdropper's output for one matched event.

Domain events

SiteMapped, OnboardingFailed, ExtractionSucceeded, ExtractionFailed, SchemaHealed, HealingFailed, SiteDegraded.

All extend AutomationEvent (frozen Pydantic model).

Endpoint vocabularies

The Architect's LLM prompt embeds a list of EndpointType definitions that tell the model "you may only classify endpoints into one of these categories". The default vocabulary covers generic CRUD shapes:

name	what it means
`list_records`	Paginated list of records (grid/table views).
`detail_view`	One record's full detail (after click-through).
`search`	Filtered records based on user query.
`create_record`	POST/PUT that creates a new record.
`update_record`	PATCH/PUT that mutates an existing record.
`delete_record`	DELETE.
`reference_data`	Lookup / enum / config data.
`metrics`	Dashboard counts/aggregates.
`unknown`	Fallback when nothing fits.

You'll usually want to extend this with site-specific categories:

from site_mapper_agents import (
    Architect,
    default_vocabulary,
    define_endpoint_type,
    merge_vocabularies,
)

vocab = merge_vocabularies(
    default_vocabulary(),
    [
        define_endpoint_type(
            name="invoice_pdf_download",
            description="Streaming download of a generated invoice PDF",
            expected_fields=["invoice_id", "pdf_url"],
        ),
        define_endpoint_type(
            name="webhook_subscription",
            description="Webhook registration endpoint that returns the subscription id",
            expected_fields=["subscription_id", "target_url", "events"],
        ),
    ],
)

architect = Architect(model=my_model, vocabulary=vocab)

LLM providers

The library binds to any provider pydantic-ai supports — just pass a Model instance (or its name) to the agent constructor:

# Anthropic
from pydantic_ai.models.anthropic import AnthropicModel
architect = Architect(model=AnthropicModel("claude-sonnet-4-5"))

# OpenAI
from pydantic_ai.models.openai import OpenAIModel
architect = Architect(model=OpenAIModel("gpt-4o"))

# Ollama (or any OpenAI-compatible local server)
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.openai import OpenAIProvider
architect = Architect(model=OpenAIModel(
    "llama3.1:8b",
    provider=OpenAIProvider(base_url="http://localhost:11434/v1"),
))

# Deterministic stub for tests
from pydantic_ai.models.test import TestModel
architect = Architect(model=TestModel())

CDP burst format

CDPNetworkEvent is the only input shape the library cares about:

CDPNetworkEvent(
    request_id="<unique-id>",
    url="https://...",
    method="GET",
    status_code=200,
    headers={"content-type": "application/json"},
    body={"data": {"...": "..."}},   # parsed JSON
    frame_origin=None,                # set for iframe traffic
    target_id=None,                   # CDP target id, for multi-frame disambiguation
    timestamp=1715760000.0,
)

The library does not capture CDP traffic itself. Use a sibling tool — e.g. axumquant/cdp-network-interceptor — or your own Chrome extension / Puppeteer / Playwright session that emits this shape.

Self-healing flow

When does the Healer fire?

The Eavesdropper validates an incoming event and detects missing fields against a registered signature.
It emits ExtractionFailed and returns it from ingest().
Your orchestrator passes the failed event (plus the raw response body) to Healer.diagnose().
The Healer runs structural matching first (same key still exists? then we just need a path tweak). If everything resolves structurally, no LLM call happens.
Otherwise the Healer calls its pydantic-ai Agent with the old key map + new available keys + unresolved field names.
The returned HealerPatch has an aggregate confidence:
- ≥ auto_approve_above (default 0.90) → apply_patch() succeeds, emits SchemaHealed, signature is replaced in-place.
- [min_semantic_confidence, require_human_review_below) (default 0.70–0.75) → apply_patch() returns HealingFailed with reason requires human review. Surface this to the user.
- < min_semantic_confidence → site is marked DEGRADED, retried up to max_attempts times, then marked BROKEN.
Persistence is the caller's job — the library mutates the MappedSite aggregate in memory but doesn't write it anywhere.

Use cases

Salesforce custom-object extraction — Salesforce's API surface is huge and per-tenant. Onboard once against the tenant you have a session on, extract from then on.
HubSpot scraping — undocumented internal endpoints powering the UI.
Internal CRM discovery — your customer is on some no-name CRM you've never seen. Onboarding takes minutes.
Pre-acquisition portal audits — point it at a target's admin portal, get back a structured map of their data surface.
Partner integrations with companies who refuse to ship an API.

Pitfalls

The Architect costs money — it's an LLM call with a non-trivial prompt + context. Budget for one call per site you map. The Eavesdropper is free; the Healer only fires when something breaks.
Schema drift is real — sites change shapes monthly. Wire the Healer or you'll be debugging in production.
Auth-protected endpoints — the library never authenticates for you. You drive a real browser session; the CDP forwarder captures authenticated traffic. The library only sees the resulting bodies.
Rate limits — your scraping cadence is your problem. Polite pacing is on you.
Iframe traffic — the library handles frame_origin matching correctly, but your CDP forwarder MUST populate it. Without frame_origin, iframe responses match parent-frame signatures, which produces garbage extractions.
The vocabulary matters — generic CRUD works for most sites, but niche portals benefit a lot from a custom vocabulary that names the domain entities (e.g. invoice_line_items vs generic list_records).

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
examples		examples
src/site_mapper_agents		src/site_mapper_agents
tests		tests
.gitignore		.gitignore
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SKILL.md		SKILL.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

site-mapper-agents

The problem

What this solves

The three agents

Install

Quickstart

API reference

`Architect(model=None, vocabulary=None, policy=DEFAULT_ONBOARDING_POLICY, model_settings=None)`

`Eavesdropper(policy=DEFAULT_EXTRACTION_POLICY)`

`Healer(model=None, policy=DEFAULT_HEALING_POLICY, model_settings=None)`

Models

Domain events

Endpoint vocabularies

LLM providers

CDP burst format

Self-healing flow

Use cases

Pitfalls

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

site-mapper-agents

The problem

What this solves

The three agents

Install

Quickstart

API reference

Architect(model=None, vocabulary=None, policy=DEFAULT_ONBOARDING_POLICY, model_settings=None)

Eavesdropper(policy=DEFAULT_EXTRACTION_POLICY)

Healer(model=None, policy=DEFAULT_HEALING_POLICY, model_settings=None)

Models

Domain events

Endpoint vocabularies

LLM providers

CDP burst format

Self-healing flow

Use cases

Pitfalls

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`Architect(model=None, vocabulary=None, policy=DEFAULT_ONBOARDING_POLICY, model_settings=None)`

`Eavesdropper(policy=DEFAULT_EXTRACTION_POLICY)`

`Healer(model=None, policy=DEFAULT_HEALING_POLICY, model_settings=None)`

Packages