LLM-once API discovery + self-healing extraction for any browser-accessible portal.
Burst-record CDP network traffic from a portal you have a browser session on, hand it to a three-agent team, get back a typed schema + signatures you can extract from forever — with auto-repair when the portal's API shape drifts.
Every SaaS portal has a different API. Writing extractors for each is a treadmill — and the schemas change without warning, so your extractors silently break.
Pre-built connectors only cover the top 20 platforms. For everything else (internal CRMs, niche-vertical tools, undocumented partner portals) you either pay someone to reverse-engineer the API, or you give up and scrape the DOM.
This library is the third option.
- Onboarding: you point the system at a portal you have a real browser
session on. It records a burst of CDP network traffic while you click
around, then asks an LLM once to classify which endpoints carry the
data you want and to map response JSON keys to your fields. Output is a
typed
SiteSchemaand a list ofNetworkSignaturepatterns. - Extraction: from that point forward, every CDP event is matched against the saved signatures with pure Pydantic validation — sub-millisecond, no LLM calls, no cost.
- Self-healing: when the portal changes its response shape, an
ExtractionFailedevent fires. The Healer compares the old key map against the new response, fixes what it can deterministically, and asks the LLM to semantically match the rest. Confident patches auto-apply. Borderline patches surface for human review.
┌──────────────────────────────────────────────┐
│ Browser session → CDP forwarder → events │
└──────────────────────┬───────────────────────┘
│
┌──────────────────────┐ │ ┌───────────────────────┐
│ Architect │ ◀── once ──── │ ──── live ──▶│ Eavesdropper │
│ (LLM classifies │ │ │ (Pydantic only, │
│ endpoints, builds │ │ │ sub-ms hot path) │
│ SiteSchema + │ │ │ │
│ signatures) │ │ │ emits ExtractionResult
└──────────────────────┘ │ │ or ExtractionFailed │
│ │ └───────────┬───────────┘
▼ │ │
╔══════════════════════╗ │ ┌───────────▼───────────┐
║ MappedSite + ║◀──── heals ───┼──────────────│ Healer │
║ NetworkSignatures ║ │ │ (LLM re-maps stale │
╚══════════════════════╝ │ │ keys, auto-applies │
│ │ confident patches) │
│ └───────────────────────┘
- Architect — runs once. Expensive. Produces the schema.
- Eavesdropper — runs on every event. Free. Pure validation.
- Healer — runs only on failures. Costs nothing when nothing breaks.
pip install site-mapper-agentsFor the runnable examples you'll also want a pydantic-ai provider:
pip install 'pydantic-ai[anthropic]' # or [openai], [ollama], ...import asyncio
from pydantic_ai.models.test import TestModel
from site_mapper_agents import (
Architect,
CDPNetworkEvent,
Eavesdropper,
TargetField,
UserIntent,
)
# 1. Tell the system what you want to extract.
intent = UserIntent(
description="Customer account details",
target_fields=[
TargetField(name="account_id", description="Account UUID"),
TargetField(name="email", description="Primary contact email"),
],
)
# 2. Construct the Architect. Replace TestModel with a real provider.
architect = Architect(model=TestModel()) # or AnthropicModel("claude-sonnet-4-5")
# 3. Feed it a burst of CDP traffic (your forwarder produced these).
architect.record_traffic(CDPNetworkEvent(
request_id="r1",
url="https://crm.example.com/api/v2/accounts/42",
method="GET",
body={"data": {"client": {"id": "acct_42", "email": "ada@example.com"}}},
))
# 4. Ask the Architect to propose a schema.
async def onboard():
proposal = await architect.propose(
target_url="https://crm.example.com/accounts",
user_intent=intent,
)
site = architect.build_mapped_site(
proposal=proposal,
target_url="https://crm.example.com/accounts",
user_intent=intent,
)
return site
site = asyncio.run(onboard())
# 5. From now on, every live CDP event runs through the Eavesdropper.
eaves = Eavesdropper()
result, event = eaves.ingest(
CDPNetworkEvent(
request_id="r2",
url="https://crm.example.com/api/v2/accounts/99",
method="GET",
body={"data": {"client": {"id": "acct_99", "email": "g@example.com"}}},
),
sites=[site],
)
print(result.data_payload if result else "no match")The onboarding agent. LLM-once.
| Parameter | Type | Notes |
|---|---|---|
model |
pydantic_ai.Model | None |
Any pydantic-ai model. None → heuristic. |
vocabulary |
list[EndpointType] | None |
Caller-supplied classifications. See below. |
policy |
OnboardingPolicy |
Sample-count thresholds. |
model_settings |
ModelSettings | None |
max_tokens, temperature, etc. |
Methods:
record_traffic(event)— buffer a CDP event during onboarding.record_click()— mark that the user clicked something.has_enough_samples()→bool— policy check.detect_endpoints()→list[DetectedEndpoint]— deterministic pre-processing.await propose(*, target_url, user_intent, llm_classify=None)→ArchitectProposal— the main entry point.build_mapped_site(*, proposal, target_url, user_intent)→MappedSite— promote an approved proposal to an active site.emit_event(site, *, success=True, reason="")→SiteMapped | OnboardingFailed.reset()— clear buffers for the next onboarding session.
The runtime agent. No LLM. Pure Pydantic validation.
Methods:
ingest(event, sites)→(ExtractionResult | None, ExtractionSucceeded | ExtractionFailed | None).
The self-healing agent.
Methods:
await diagnose(*, site, failed_event, new_response_body=None, llm_semantic_match=None)→HealerPatch.apply_patch(site, patch)→(bool, SchemaHealed | HealingFailed | SiteDegraded).
| Class | Purpose |
|---|---|
CDPNetworkEvent |
One captured network response. Library input. |
TargetField |
One data point the caller wants extracted. |
UserIntent |
A bundle of target fields with a human description. |
EndpointType |
One entry in the Architect's classification vocabulary. |
DetectedEndpoint |
Pre-LLM view of a unique endpoint. |
NetworkSignature |
URL pattern + JSON-key map. Saved per site. |
SiteSchema |
The extraction contract for one intent. |
ArchitectProposal |
Architect's structured output before user confirms. |
HealerPatch |
Healer's structured output for one repair attempt. |
MappedSite |
Aggregate root — schemas + signatures + status. |
ExtractionResult |
Eavesdropper's output for one matched event. |
SiteMapped, OnboardingFailed, ExtractionSucceeded,
ExtractionFailed, SchemaHealed, HealingFailed, SiteDegraded.
All extend AutomationEvent (frozen Pydantic model).
The Architect's LLM prompt embeds a list of EndpointType definitions
that tell the model "you may only classify endpoints into one of these
categories". The default vocabulary covers generic CRUD shapes:
| name | what it means |
|---|---|
list_records |
Paginated list of records (grid/table views). |
detail_view |
One record's full detail (after click-through). |
search |
Filtered records based on user query. |
create_record |
POST/PUT that creates a new record. |
update_record |
PATCH/PUT that mutates an existing record. |
delete_record |
DELETE. |
reference_data |
Lookup / enum / config data. |
metrics |
Dashboard counts/aggregates. |
unknown |
Fallback when nothing fits. |
You'll usually want to extend this with site-specific categories:
from site_mapper_agents import (
Architect,
default_vocabulary,
define_endpoint_type,
merge_vocabularies,
)
vocab = merge_vocabularies(
default_vocabulary(),
[
define_endpoint_type(
name="invoice_pdf_download",
description="Streaming download of a generated invoice PDF",
expected_fields=["invoice_id", "pdf_url"],
),
define_endpoint_type(
name="webhook_subscription",
description="Webhook registration endpoint that returns the subscription id",
expected_fields=["subscription_id", "target_url", "events"],
),
],
)
architect = Architect(model=my_model, vocabulary=vocab)The library binds to any provider pydantic-ai supports — just pass a
Model instance (or its name) to the agent constructor:
# Anthropic
from pydantic_ai.models.anthropic import AnthropicModel
architect = Architect(model=AnthropicModel("claude-sonnet-4-5"))
# OpenAI
from pydantic_ai.models.openai import OpenAIModel
architect = Architect(model=OpenAIModel("gpt-4o"))
# Ollama (or any OpenAI-compatible local server)
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.openai import OpenAIProvider
architect = Architect(model=OpenAIModel(
"llama3.1:8b",
provider=OpenAIProvider(base_url="http://localhost:11434/v1"),
))
# Deterministic stub for tests
from pydantic_ai.models.test import TestModel
architect = Architect(model=TestModel())CDPNetworkEvent is the only input shape the library cares about:
CDPNetworkEvent(
request_id="<unique-id>",
url="https://...",
method="GET",
status_code=200,
headers={"content-type": "application/json"},
body={"data": {"...": "..."}}, # parsed JSON
frame_origin=None, # set for iframe traffic
target_id=None, # CDP target id, for multi-frame disambiguation
timestamp=1715760000.0,
)The library does not capture CDP traffic itself. Use a sibling tool — e.g. axumquant/cdp-network-interceptor — or your own Chrome extension / Puppeteer / Playwright session that emits this shape.
When does the Healer fire?
- The Eavesdropper validates an incoming event and detects missing fields against a registered signature.
- It emits
ExtractionFailedand returns it fromingest(). - Your orchestrator passes the failed event (plus the raw response
body) to
Healer.diagnose(). - The Healer runs structural matching first (same key still exists? then we just need a path tweak). If everything resolves structurally, no LLM call happens.
- Otherwise the Healer calls its pydantic-ai Agent with the old key map + new available keys + unresolved field names.
- The returned
HealerPatchhas an aggregate confidence:≥ auto_approve_above(default 0.90) →apply_patch()succeeds, emitsSchemaHealed, signature is replaced in-place.[min_semantic_confidence, require_human_review_below)(default 0.70–0.75) →apply_patch()returnsHealingFailedwith reasonrequires human review. Surface this to the user.< min_semantic_confidence→ site is marked DEGRADED, retried up tomax_attemptstimes, then marked BROKEN.
- Persistence is the caller's job — the library mutates the
MappedSiteaggregate in memory but doesn't write it anywhere.
- Salesforce custom-object extraction — Salesforce's API surface is huge and per-tenant. Onboard once against the tenant you have a session on, extract from then on.
- HubSpot scraping — undocumented internal endpoints powering the UI.
- Internal CRM discovery — your customer is on some no-name CRM you've never seen. Onboarding takes minutes.
- Pre-acquisition portal audits — point it at a target's admin portal, get back a structured map of their data surface.
- Partner integrations with companies who refuse to ship an API.
- The Architect costs money — it's an LLM call with a non-trivial prompt + context. Budget for one call per site you map. The Eavesdropper is free; the Healer only fires when something breaks.
- Schema drift is real — sites change shapes monthly. Wire the Healer or you'll be debugging in production.
- Auth-protected endpoints — the library never authenticates for you. You drive a real browser session; the CDP forwarder captures authenticated traffic. The library only sees the resulting bodies.
- Rate limits — your scraping cadence is your problem. Polite pacing is on you.
- Iframe traffic — the library handles
frame_originmatching correctly, but your CDP forwarder MUST populate it. Withoutframe_origin, iframe responses match parent-frame signatures, which produces garbage extractions. - The vocabulary matters — generic CRUD works for most sites, but
niche portals benefit a lot from a custom vocabulary that names the
domain entities (e.g.
invoice_line_itemsvs genericlist_records).
MIT — see LICENSE.