Skip to content

axumquant/site-mapper-agents

Repository files navigation

site-mapper-agents

LLM-once API discovery + self-healing extraction for any browser-accessible portal.

Burst-record CDP network traffic from a portal you have a browser session on, hand it to a three-agent team, get back a typed schema + signatures you can extract from forever — with auto-repair when the portal's API shape drifts.

PyPI Python License


The problem

Every SaaS portal has a different API. Writing extractors for each is a treadmill — and the schemas change without warning, so your extractors silently break.

Pre-built connectors only cover the top 20 platforms. For everything else (internal CRMs, niche-vertical tools, undocumented partner portals) you either pay someone to reverse-engineer the API, or you give up and scrape the DOM.

This library is the third option.

What this solves

  1. Onboarding: you point the system at a portal you have a real browser session on. It records a burst of CDP network traffic while you click around, then asks an LLM once to classify which endpoints carry the data you want and to map response JSON keys to your fields. Output is a typed SiteSchema and a list of NetworkSignature patterns.
  2. Extraction: from that point forward, every CDP event is matched against the saved signatures with pure Pydantic validation — sub-millisecond, no LLM calls, no cost.
  3. Self-healing: when the portal changes its response shape, an ExtractionFailed event fires. The Healer compares the old key map against the new response, fixes what it can deterministically, and asks the LLM to semantically match the rest. Confident patches auto-apply. Borderline patches surface for human review.

The three agents

                   ┌──────────────────────────────────────────────┐
                   │  Browser session → CDP forwarder → events    │
                   └──────────────────────┬───────────────────────┘
                                          │
   ┌──────────────────────┐               │              ┌───────────────────────┐
   │     Architect        │ ◀── once ──── │ ──── live ──▶│     Eavesdropper      │
   │  (LLM classifies     │               │              │  (Pydantic only,      │
   │   endpoints, builds  │               │              │   sub-ms hot path)    │
   │   SiteSchema +       │               │              │                       │
   │   signatures)        │               │              │   emits ExtractionResult
   └──────────────────────┘               │              │   or ExtractionFailed │
              │                           │              └───────────┬───────────┘
              ▼                           │                          │
   ╔══════════════════════╗               │              ┌───────────▼───────────┐
   ║   MappedSite +       ║◀──── heals ───┼──────────────│       Healer          │
   ║   NetworkSignatures  ║               │              │  (LLM re-maps stale   │
   ╚══════════════════════╝               │              │   keys, auto-applies  │
                                          │              │   confident patches)  │
                                          │              └───────────────────────┘
  • Architect — runs once. Expensive. Produces the schema.
  • Eavesdropper — runs on every event. Free. Pure validation.
  • Healer — runs only on failures. Costs nothing when nothing breaks.

Install

pip install site-mapper-agents

For the runnable examples you'll also want a pydantic-ai provider:

pip install 'pydantic-ai[anthropic]'   # or [openai], [ollama], ...

Quickstart

import asyncio
from pydantic_ai.models.test import TestModel

from site_mapper_agents import (
    Architect,
    CDPNetworkEvent,
    Eavesdropper,
    TargetField,
    UserIntent,
)

# 1. Tell the system what you want to extract.
intent = UserIntent(
    description="Customer account details",
    target_fields=[
        TargetField(name="account_id", description="Account UUID"),
        TargetField(name="email", description="Primary contact email"),
    ],
)

# 2. Construct the Architect. Replace TestModel with a real provider.
architect = Architect(model=TestModel())  # or AnthropicModel("claude-sonnet-4-5")

# 3. Feed it a burst of CDP traffic (your forwarder produced these).
architect.record_traffic(CDPNetworkEvent(
    request_id="r1",
    url="https://crm.example.com/api/v2/accounts/42",
    method="GET",
    body={"data": {"client": {"id": "acct_42", "email": "ada@example.com"}}},
))

# 4. Ask the Architect to propose a schema.
async def onboard():
    proposal = await architect.propose(
        target_url="https://crm.example.com/accounts",
        user_intent=intent,
    )
    site = architect.build_mapped_site(
        proposal=proposal,
        target_url="https://crm.example.com/accounts",
        user_intent=intent,
    )
    return site

site = asyncio.run(onboard())

# 5. From now on, every live CDP event runs through the Eavesdropper.
eaves = Eavesdropper()
result, event = eaves.ingest(
    CDPNetworkEvent(
        request_id="r2",
        url="https://crm.example.com/api/v2/accounts/99",
        method="GET",
        body={"data": {"client": {"id": "acct_99", "email": "g@example.com"}}},
    ),
    sites=[site],
)
print(result.data_payload if result else "no match")

API reference

Architect(model=None, vocabulary=None, policy=DEFAULT_ONBOARDING_POLICY, model_settings=None)

The onboarding agent. LLM-once.

Parameter Type Notes
model pydantic_ai.Model | None Any pydantic-ai model. None → heuristic.
vocabulary list[EndpointType] | None Caller-supplied classifications. See below.
policy OnboardingPolicy Sample-count thresholds.
model_settings ModelSettings | None max_tokens, temperature, etc.

Methods:

  • record_traffic(event) — buffer a CDP event during onboarding.
  • record_click() — mark that the user clicked something.
  • has_enough_samples()bool — policy check.
  • detect_endpoints()list[DetectedEndpoint] — deterministic pre-processing.
  • await propose(*, target_url, user_intent, llm_classify=None)ArchitectProposal — the main entry point.
  • build_mapped_site(*, proposal, target_url, user_intent)MappedSite — promote an approved proposal to an active site.
  • emit_event(site, *, success=True, reason="")SiteMapped | OnboardingFailed.
  • reset() — clear buffers for the next onboarding session.

Eavesdropper(policy=DEFAULT_EXTRACTION_POLICY)

The runtime agent. No LLM. Pure Pydantic validation.

Methods:

  • ingest(event, sites)(ExtractionResult | None, ExtractionSucceeded | ExtractionFailed | None).

Healer(model=None, policy=DEFAULT_HEALING_POLICY, model_settings=None)

The self-healing agent.

Methods:

  • await diagnose(*, site, failed_event, new_response_body=None, llm_semantic_match=None)HealerPatch.
  • apply_patch(site, patch)(bool, SchemaHealed | HealingFailed | SiteDegraded).

Models

Class Purpose
CDPNetworkEvent One captured network response. Library input.
TargetField One data point the caller wants extracted.
UserIntent A bundle of target fields with a human description.
EndpointType One entry in the Architect's classification vocabulary.
DetectedEndpoint Pre-LLM view of a unique endpoint.
NetworkSignature URL pattern + JSON-key map. Saved per site.
SiteSchema The extraction contract for one intent.
ArchitectProposal Architect's structured output before user confirms.
HealerPatch Healer's structured output for one repair attempt.
MappedSite Aggregate root — schemas + signatures + status.
ExtractionResult Eavesdropper's output for one matched event.

Domain events

SiteMapped, OnboardingFailed, ExtractionSucceeded, ExtractionFailed, SchemaHealed, HealingFailed, SiteDegraded.

All extend AutomationEvent (frozen Pydantic model).

Endpoint vocabularies

The Architect's LLM prompt embeds a list of EndpointType definitions that tell the model "you may only classify endpoints into one of these categories". The default vocabulary covers generic CRUD shapes:

name what it means
list_records Paginated list of records (grid/table views).
detail_view One record's full detail (after click-through).
search Filtered records based on user query.
create_record POST/PUT that creates a new record.
update_record PATCH/PUT that mutates an existing record.
delete_record DELETE.
reference_data Lookup / enum / config data.
metrics Dashboard counts/aggregates.
unknown Fallback when nothing fits.

You'll usually want to extend this with site-specific categories:

from site_mapper_agents import (
    Architect,
    default_vocabulary,
    define_endpoint_type,
    merge_vocabularies,
)

vocab = merge_vocabularies(
    default_vocabulary(),
    [
        define_endpoint_type(
            name="invoice_pdf_download",
            description="Streaming download of a generated invoice PDF",
            expected_fields=["invoice_id", "pdf_url"],
        ),
        define_endpoint_type(
            name="webhook_subscription",
            description="Webhook registration endpoint that returns the subscription id",
            expected_fields=["subscription_id", "target_url", "events"],
        ),
    ],
)

architect = Architect(model=my_model, vocabulary=vocab)

LLM providers

The library binds to any provider pydantic-ai supports — just pass a Model instance (or its name) to the agent constructor:

# Anthropic
from pydantic_ai.models.anthropic import AnthropicModel
architect = Architect(model=AnthropicModel("claude-sonnet-4-5"))

# OpenAI
from pydantic_ai.models.openai import OpenAIModel
architect = Architect(model=OpenAIModel("gpt-4o"))

# Ollama (or any OpenAI-compatible local server)
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.openai import OpenAIProvider
architect = Architect(model=OpenAIModel(
    "llama3.1:8b",
    provider=OpenAIProvider(base_url="http://localhost:11434/v1"),
))

# Deterministic stub for tests
from pydantic_ai.models.test import TestModel
architect = Architect(model=TestModel())

CDP burst format

CDPNetworkEvent is the only input shape the library cares about:

CDPNetworkEvent(
    request_id="<unique-id>",
    url="https://...",
    method="GET",
    status_code=200,
    headers={"content-type": "application/json"},
    body={"data": {"...": "..."}},   # parsed JSON
    frame_origin=None,                # set for iframe traffic
    target_id=None,                   # CDP target id, for multi-frame disambiguation
    timestamp=1715760000.0,
)

The library does not capture CDP traffic itself. Use a sibling tool — e.g. axumquant/cdp-network-interceptor — or your own Chrome extension / Puppeteer / Playwright session that emits this shape.

Self-healing flow

When does the Healer fire?

  1. The Eavesdropper validates an incoming event and detects missing fields against a registered signature.
  2. It emits ExtractionFailed and returns it from ingest().
  3. Your orchestrator passes the failed event (plus the raw response body) to Healer.diagnose().
  4. The Healer runs structural matching first (same key still exists? then we just need a path tweak). If everything resolves structurally, no LLM call happens.
  5. Otherwise the Healer calls its pydantic-ai Agent with the old key map + new available keys + unresolved field names.
  6. The returned HealerPatch has an aggregate confidence:
    • ≥ auto_approve_above (default 0.90) → apply_patch() succeeds, emits SchemaHealed, signature is replaced in-place.
    • [min_semantic_confidence, require_human_review_below) (default 0.70–0.75) → apply_patch() returns HealingFailed with reason requires human review. Surface this to the user.
    • < min_semantic_confidence → site is marked DEGRADED, retried up to max_attempts times, then marked BROKEN.
  7. Persistence is the caller's job — the library mutates the MappedSite aggregate in memory but doesn't write it anywhere.

Use cases

  • Salesforce custom-object extraction — Salesforce's API surface is huge and per-tenant. Onboard once against the tenant you have a session on, extract from then on.
  • HubSpot scraping — undocumented internal endpoints powering the UI.
  • Internal CRM discovery — your customer is on some no-name CRM you've never seen. Onboarding takes minutes.
  • Pre-acquisition portal audits — point it at a target's admin portal, get back a structured map of their data surface.
  • Partner integrations with companies who refuse to ship an API.

Pitfalls

  • The Architect costs money — it's an LLM call with a non-trivial prompt + context. Budget for one call per site you map. The Eavesdropper is free; the Healer only fires when something breaks.
  • Schema drift is real — sites change shapes monthly. Wire the Healer or you'll be debugging in production.
  • Auth-protected endpoints — the library never authenticates for you. You drive a real browser session; the CDP forwarder captures authenticated traffic. The library only sees the resulting bodies.
  • Rate limits — your scraping cadence is your problem. Polite pacing is on you.
  • Iframe traffic — the library handles frame_origin matching correctly, but your CDP forwarder MUST populate it. Without frame_origin, iframe responses match parent-frame signatures, which produces garbage extractions.
  • The vocabulary matters — generic CRUD works for most sites, but niche portals benefit a lot from a custom vocabulary that names the domain entities (e.g. invoice_line_items vs generic list_records).

License

MIT — see LICENSE.

About

LLM-driven self-healing API discovery for undocumented SaaS portals via CDP network captures (Pydantic AI)

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages