diff --git a/README.md b/README.md index ab7f6a5d..8d4c0bfe 100644 --- a/README.md +++ b/README.md @@ -37,6 +37,10 @@ Click Allow when the per-attach popup appears (Chrome 144+): Allow remote debugging popup See [agent-workspace/domain-skills/](agent-workspace/domain-skills/) for example tasks. +For the core pattern behind hard browser tasks, read +[Self-modifying browser harness](docs/self-modifying-harness.md). It includes +concrete examples for signature canvases, file uploads, drag/drop, and +coordinate-only controls, plus a local edge-case benchmark page. ## Free Browser Use Cloud browsers diff --git a/SKILL.md b/SKILL.md index 531d0ab9..c81f652d 100644 --- a/SKILL.md +++ b/SKILL.md @@ -81,6 +81,7 @@ If you start struggling with a specific mechanic while navigating, look in inter ## What actually works - Screenshots first: use capture_screenshot() to understand the current page quickly, find visible targets, and decide whether you need a click, a selector, or more navigation. +- If an interaction helper is missing, treat that as editable harness work, not a task failure. Reproduce, inspect, add the smallest helper in `agent-workspace/agent_helpers.py`, retry, and keep the reusable pattern. See `docs/self-modifying-harness.md` and `docs/edge-case-benchmark.html` for upload, drag/drop, signature canvas, and coordinate-only examples. - Clicking: capture_screenshot() → read the pixel off the image → click_at_xy(x, y) → capture_screenshot() to verify. Suppress the Playwright-habit reflex of "locate first, then click" — no getBoundingClientRect, no selector hunt. Drop to DOM only when the target has no visible geometry (hidden input, 0×0 node). Hit-testing happens in Chrome's browser process, so clicks go through iframes / shadow DOM / cross-origin without extra work. - Bulk HTTP: http_get(url) + ThreadPoolExecutor. No browser for static pages (249 Netflix pages in 2.8s). - After goto: wait_for_load(). diff --git a/docs/edge-case-benchmark.html b/docs/edge-case-benchmark.html new file mode 100644 index 00000000..8e37af84 --- /dev/null +++ b/docs/edge-case-benchmark.html @@ -0,0 +1,294 @@ + + + + + + Browser Harness edge-case benchmark + + + +
+

Browser Harness edge-case benchmark

+

+ Four small browser tasks that usually force agents to switch tactics: + hidden file input, drag/drop event payloads, canvas signature input, and a + coordinate-only canvas target. +

+ +
+
+

1. File upload

+ + +
waiting for file
+
+ +
+

2. Drag and drop

+
drag token
+
drop token here
+
waiting for drop
+
+ +
+

3. Canvas signature

+ +
waiting for stroke
+
+ +
+

4. Coordinate target

+ +
waiting for target click
+
+
+ +
window.bhBenchmarkResults() -> pending
+
+ + + + diff --git a/docs/self-modifying-harness.md b/docs/self-modifying-harness.md new file mode 100644 index 00000000..92cfe848 --- /dev/null +++ b/docs/self-modifying-harness.md @@ -0,0 +1,156 @@ +# Self-modifying harness + +Browser Harness works because the agent is not trapped behind a fixed browser API. +When a page has a new edge case, the agent can inspect the page, add the missing +helper in `agent-workspace/agent_helpers.py` or a domain skill, retry the task, +and keep the working code for the next run. + +The loop is: + +1. Use `capture_screenshot()` and `page_info()` to understand the current state. +2. Try the smallest existing primitive: `click_at_xy`, `type_text`, `press_key`, + `upload_file`, `js`, or raw `cdp`. +3. If the primitive is missing, add a helper in `agent-workspace/agent_helpers.py` + or a site-specific note in `agent-workspace/domain-skills//`. +4. Run the helper, then verify with a screenshot or page state. +5. Commit only durable knowledge: selectors, DOM events, CDP calls, endpoint + shape, or framework quirks. Do not commit secrets or one-off pixel positions. + +That is why browser edge cases are not permanent. They are usually just missing +local code. + +## Example 1: file upload + +Start with the built-in helper when there is a real file input: + +```python +upload_file("input[type=file]", "/tmp/invoice.pdf") +js(""" +const input = document.querySelector('input[type=file]'); +input.dispatchEvent(new Event('input', {bubbles: true})); +input.dispatchEvent(new Event('change', {bubbles: true})); +""") +``` + +If the app hides the input behind a custom button, add a helper that finds the +real input once and then reuses the same path: + +```python +# agent-workspace/agent_helpers.py +def upload_to_hidden_input(label_text, path): + selector = js(f""" + const labels = [...document.querySelectorAll('label,button,[role=button]')]; + const trigger = labels.find(e => e.innerText.trim().includes({label_text!r})); + const root = trigger?.closest('form,section,div') || document; + const input = root.querySelector('input[type=file]'); + if (!input) throw new Error('no file input near upload trigger'); + input.setAttribute('data-bh-upload-target', '1'); + return '[data-bh-upload-target="1"]'; + """) + upload_file(selector, path) +``` + +Keep this in a domain skill if the selector is specific to one product. + +## Example 2: drag and drop + +Try compositor-level input first when the page responds to real pointer motion: + +```python +cdp("Input.dispatchMouseEvent", type="mouseMoved", x=120, y=220) +cdp("Input.dispatchMouseEvent", type="mousePressed", x=120, y=220, button="left") +cdp("Input.dispatchMouseEvent", type="mouseMoved", x=500, y=260, button="left") +cdp("Input.dispatchMouseEvent", type="mouseReleased", x=500, y=260, button="left") +``` + +Some React or editor surfaces require DOM drag events with a `DataTransfer` +payload. Put that site-specific sequence in a helper: + +```python +# agent-workspace/agent_helpers.py +def dom_drag_text(source_selector, target_selector, payload): + js(f""" + const source = document.querySelector({source_selector!r}); + const target = document.querySelector({target_selector!r}); + const dt = new DataTransfer(); + dt.setData('text/plain', {payload!r}); + for (const type of ['dragstart', 'dragenter', 'dragover', 'drop', 'dragend']) {{ + const node = type === 'dragstart' || type === 'dragend' ? source : target; + node.dispatchEvent(new DragEvent(type, {{bubbles: true, cancelable: true, dataTransfer: dt}})); + }} + """) +``` + +The durable knowledge is not the one drag coordinate. It is whether that app +wants real pointer events, DOM drag events, or a hidden file input. + +## Example 3: canvas signature + +Canvas fields often have no useful DOM target inside the drawing area. Use the +canvas bounds plus raw mouse events: + +```python +box = js(""" +const r = document.querySelector('canvas.signature').getBoundingClientRect(); +return {x: r.left, y: r.top, w: r.width, h: r.height}; +""") + +points = [ + (box["x"] + 20, box["y"] + box["h"] * 0.55), + (box["x"] + 80, box["y"] + box["h"] * 0.35), + (box["x"] + 150, box["y"] + box["h"] * 0.60), + (box["x"] + 220, box["y"] + box["h"] * 0.40), +] +cdp("Input.dispatchMouseEvent", type="mousePressed", x=points[0][0], y=points[0][1], button="left") +for x, y in points[1:]: + cdp("Input.dispatchMouseEvent", type="mouseMoved", x=x, y=y, button="left") +cdp("Input.dispatchMouseEvent", type="mouseReleased", x=points[-1][0], y=points[-1][1], button="left") +``` + +Verify by reading app state or screenshotting the canvas. If the page stores a +hidden signature value, add a site helper that checks that field too. + +## Example 4: coordinate-only target + +Some UI is only visible pixels: maps, whiteboards, games, image editors, or +custom canvas widgets. Do not hunt for selectors that do not exist. Let the +screenshot drive the next action: + +```python +shot = capture_screenshot("/tmp/target.png", max_dim=1800) +info = page_info() +print(info) +# read the target position from the screenshot, convert image pixels to CSS +# pixels if devicePixelRatio > 1, then click: +click_at_xy(412, 318) +capture_screenshot("/tmp/after.png", max_dim=1800) +``` + +If the target is stable enough to compute, write the helper: + +```python +# agent-workspace/agent_helpers.py +def click_canvas_target(canvas_selector, rel_x, rel_y): + box = js(f""" + const r = document.querySelector({canvas_selector!r}).getBoundingClientRect(); + return {{x: r.left, y: r.top, w: r.width, h: r.height}}; + """) + click_at_xy(box["x"] + box["w"] * rel_x, box["y"] + box["h"] * rel_y) +``` + +The point is not that coordinates are magic. The point is that the agent can +choose coordinates, DOM, CDP, or a new helper after it sees the actual page. + +## Benchmark page + +`docs/edge-case-benchmark.html` contains a small local page with four tasks: + +- file upload +- drag and drop +- canvas signature +- coordinate-only canvas click + +Use it when changing helper behavior or when demonstrating the self-modifying +loop. A run is successful when `window.bhBenchmarkResults()` returns all four +checks as `true`. +