Skip to content

axumquant/agentic-browser-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agentic Browser Lab

Multi-agent browser automation that interviews you about the page, records your workflow, and replays it for any user — all from a Chrome extension. No headless browser, no Playwright server, no Selenium.

What makes this different

Browser-Use Skyvern Anthropic Computer Use Agentic Browser Lab
Delivery Python lib (headless) Cloud SaaS (headless) OS-level desktop control Chrome MV3 extension
Agent design Single LLM prompt Single LLM prompt Single Claude call Multi-agent team (Perceiver + Planner + Interviewer)
Onboarding UX Write prompts Describe form Just describe Agent asks YOU multi-choice questions
Recovery Re-prompt Re-run Re-shoot Episodic memory + selector-failure learning
Uses user's real Chrome (cookies, auth) ❌ (system Chrome) ✅ Their actual logged-in session

Architecture (Bounded Context per agentic-DDD)

┌─────────────────────────────────────────────────────────────────────┐
│  Browser side (Chrome MV3 extension)                                │
│  ────────────────────────────────────                                │
│  content-portals.js          background.js          executor.js     │
│  ┌──────────────────┐        ┌──────────────┐    ┌───────────────┐ │
│  │ Picker overlay    │        │ MV3 service  │    │ CDP-based     │ │
│  │ + HITL cards      │◀─────▶│ worker       │◀──▶│ click / type/ │ │
│  │ + Interview modal │        │ + screenshot │    │ navigate      │ │
│  │ + animated cursor │        │ + WS to API  │    │ (chrome.dbg)  │ │
│  └──────────────────┘        └──────────────┘    └───────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
                                  │
                                  │  POST /interview/start
                                  │  POST /interview/answer
                                  │  POST /events
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│  Backend (Python / Pydantic AI)                                     │
│  ────────────────────────────────                                    │
│                                                                      │
│   PerceiverAgent    ──observe─▶  PageState                          │
│   (vision + DOM)                  { intent, fields, buttons,        │
│                                     blockers, progress, confidence } │
│                                                                      │
│   InterviewerAgent  ──propose─▶  Questionnaire                      │
│   (asks the user)                 { questions: [single/multi/free] } │
│                                                                      │
│   AnswerProcessor   ──resolve─▶  WorkflowProposal | ActionPlan |    │
│                                  FollowUpQuestionnaire               │
│                                                                      │
│   PlannerAgent      ──plan────▶  ExecutorPlan                       │
│   (freeform mode)                 { actions, confidence, goal_met } │
│                                                                      │
│   AutomationMemory  ──recall──▶  MemoryHit[]   ◀── mem0 (Qdrant)    │
│   (selector-failure memory)                                          │
└─────────────────────────────────────────────────────────────────────┘

The 5 agents

Agent LLM call Job
PerceiverAgent qwen3-vl:235b → DOM-only fallback Reads the screenshot + DOM, produces structured PageState (intent, fields, buttons, blockers)
PlannerAgent deepseek-v4-pro Takes instruction + PageState + action history + episodic memory → next 1-3 actions
InterviewerAgent deepseek-v4-pro Reads the page, asks the user multi-choice questions to disambiguate intent before acting
AnswerProcessor deepseek-v4-pro Takes user answers → produces a WorkflowProposal (save it), ActionPlan (run it), or FollowUpQuestionnaire
Mem0AutomationAdapter (no LLM — vector store) Remembers: selector failures, workflow usage, run summaries. Per user, persistent.

The Chrome extension UX

User clicks "Pick" on overlay
   │
   ▼
chrome.debugger.attach({tabId}, "1.3")    ← REQUIRED, not optional
   │
   ▼
Crosshair cursor + dashed outline on hover
   │
   ▼
Click → CDP fetches:
   • Accessibility.getPartialAXTree    (role, name, value, state)
   • CSS.getComputedStyleForNode       (display, visibility, opacity)
   • DOM.getBoxModel                   (precise quad)
   • DOM.getOuterHTML                  (1200-char excerpt)
   │
   ▼
Send to backend with picked element descriptor
   │
   ▼
Planner uses the picked element as authoritative target
   │
   ▼
Animated cursor flies to the target, pulses green on click

When the agent is uncertain, a HITL card appears with clipped screenshots of each candidate element so the user can pick visually instead of reading selectors.

What you'd extract from this for your own project

Drop-in agents (Python / Pydantic AI)

from agentic_browser_lab.automation.perception_agent import observe_page
from agentic_browser_lab.automation.pai_wiring import plan_actions_from_instruction
from agentic_browser_lab.automation.interviewer_agent import interview_page

# Vision + DOM → structured page state
state = await observe_page(
    dom_summary=my_dom_dict,
    screenshot_data_url="data:image/png;base64,...",
    page_url="https://example.com",
    goal="fill the lookup form",
)
# Returns PageState(intent="lookup form", fields=[...], buttons=[...], blockers=[...])

# Instruction + page state → concrete actions
plan = await plan_actions_from_instruction(
    instruction="Enter ZIP 90210 and click Find",
    dom_summary={**page_state, "perceived_intent": state.intent, ...},
    picked_elements=[],
)
# Returns ExecutorPlan(actions=[ExecutorAction(...), ...], confidence=0.95, goal_met=False)

Drop-in Chrome extension primitives

  • content-portals.js — picker overlay, HITL cards, interview modal, animated cursor
  • executor.js — CDP click/type/key/navigate/wait_network_idle
  • extension-logger.js — structured logging to chrome.storage
  • runtime-config-shared.js — runtime config bootstrap

Episodic memory (the part nobody else has)

from agentic_browser_lab.memory import get_automation_memory

mem = get_automation_memory()

# Write
await mem.remember_selector_failure(
    org_id=org, user_id=user,
    page_url="https://sunfire.example/lookup",
    stale_selector="[data-testid=search]",
    replacement="button.lookup-primary",
    replacement_kind="selector",
)

# Read (every planner call does this)
hits = await mem.recall_for_planner(
    org_id=org, user_id=user,
    query="fill ZIP and click lookup",
    page_url="https://sunfire.example/lookup",
    page_intent="customer lookup form",
)
# Returns MemoryHit objects — the planner is instructed to AVOID
# selectors that have failed before for this user.

Backed by mem0 + Qdrant (the same stack OpenAI Operator uses internally). Falls back to in-memory storage when not configured. Never raises.

Stack

Layer Tech
Agents Pydantic AI (structured output) + Ollama Cloud (deepseek-v4-pro for reasoning, qwen3-vl for vision)
Memory mem0 + Qdrant (per-user episodic)
Extension Chrome MV3 + CDP (chrome.debugger)
Backend FastAPI + httpx

Designed to plug into your existing LLM gateway / vector store. Provider-neutral.

Status

Extracted from a production B2B SaaS in May 2026. Backend is running in prod with 3 migrations applied (027_learned_workflows, 028_learned_workflow_versions, 029_learned_workflow_marketplace — the marketplace + versioning live in the learned-workflows-marketplace sibling repo).

Sibling repos

This repo focuses on the live observe / propose / act loop. The adjacent concerns are split into focused sibling repos so each one stays small enough to read in one sitting:

Repo What it owns
learned-workflows-marketplace Storage triad (Postgres + Qdrant + Neo4j), versioning, cross-org marketplace, parameterization — the save / share / replay half
site-mapper-agents Architect + Healer + Eavesdropper agents — endpoint classification, schema repair, replay-driven extraction — the API-discovery half
cdp-network-interceptor The CDP wrapper used by site-mapper-agents to capture XHR/fetch traffic from a real Chrome tab
mv3-audio-replay-buffer The MV3 offscreen-doc audio capture primitive used elsewhere in the parent platform

If you need the full self-healing portal extraction stack (not just the live browser planner this repo provides), pip install site-mapper-agents alongside this package.

Roadmap

  • Drop the FastAPI lock-in — package as pip install agentic-browser-lab with optional FastAPI plugin
  • Pluggable LLM provider (currently coupled to Ollama Cloud, easy to abstract)
  • Headless mode for CI/CD (Playwright adapter)
  • Cross-browser (Firefox MV3 port)

License

MIT — see LICENSE

Maintainer

@axumquant — built as part of Sales Coach (Medicare insurance B2B), open-sourced for the agent-tooling community.

About

Multi-agent browser automation with a Chrome MV3 extension. Perceiver + Planner + Interviewer agents using Pydantic AI.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors