Skip to content

Commit 8898a01

Browse files
authored
fix(cua): use scroll notch count (wheel units) in all computer-use templates (#129)
updates scrolling logic across all computer use templates in the CLI. - **Branch:** `fix-cua-templates-scroll-behavior` Made with [Cursor](https://cursor.com) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Changes scroll semantics across multiple computer-use templates from pixel-based deltas to wheel-notch units, which can materially alter agent navigation behavior and task success. Also includes a few runtime guards (local execution gating, null-handling) that are low risk but touch invocation paths. > > **Overview** > **Unifies scroll behavior across computer-use templates** by switching from pixel-based scroll deltas to *wheel unit (notch) counts* in Anthropic (TS/Python), Gemini (TS/Python), Yutori (TS/Python), and OpenAGI handlers, and updating tool outputs/prompts to reflect the new units. > > **Gemini templates now convert `magnitude` (px) to capped notch counts** (via `PX_PER_NOTCH`/`MAX_NOTCHES_PER_ACTION`) and add a few safety checks (e.g., missing function names/content). Local test entrypoints in Gemini are gated on `KERNEL_INVOCATION`, and the TS Gemini session adds null-safe handling for returned URLs/IDs. > > **Template/QA naming is aligned** by changing the yutori template key from `yutori-computer-use` to `yutori` in docs (`qa.md`) and `pkg/create/templates.go`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 6f3501d. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
1 parent 43fc638 commit 8898a01

File tree

30 files changed

+125
-140
lines changed

30 files changed

+125
-140
lines changed

.cursor/commands/qa.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ Here are all valid language + template combinations:
5858
| typescript | openai-computer-use | ts-openai-cua | ts-openai-cua | Yes | OPENAI_API_KEY |
5959
| typescript | gemini-computer-use | ts-gemini-cua | ts-gemini-cua | Yes | GOOGLE_API_KEY |
6060
| typescript | claude-agent-sdk | ts-claude-agent-sdk | ts-claude-agent-sdk | Yes | ANTHROPIC_API_KEY |
61-
| typescript | yutori-computer-use | ts-yutori-cua | ts-yutori-cua | Yes | YUTORI_API_KEY |
61+
| typescript | yutori | ts-yutori-cua | ts-yutori-cua | Yes | YUTORI_API_KEY |
6262

6363
| python | sample-app | py-sample-app | python-basic | No | - |
6464
| python | gemini-computer-use | py-gemini-cua | python-gemini-cua | Yes | GOOGLE_API_KEY |
@@ -68,7 +68,7 @@ Here are all valid language + template combinations:
6868
| python | openai-computer-use | py-openai-cua | python-openai-cua | Yes | OPENAI_API_KEY |
6969
| python | openagi-computer-use | py-openagi-cua | python-openagi-cua | Yes | OAGI_API_KEY |
7070
| python | claude-agent-sdk | py-claude-agent-sdk | py-claude-agent-sdk | Yes | ANTHROPIC_API_KEY |
71-
| python | yutori-computer-use | py-yutori-cua | python-yutori-cua | Yes | YUTORI_API_KEY |
71+
| python | yutori | py-yutori-cua | python-yutori-cua | Yes | YUTORI_API_KEY |
7272

7373
> **Yutori:** Test both default browser and `"kiosk": true` (uses Playwright for goto_url when kiosk is enabled).
7474
@@ -86,7 +86,7 @@ Run each of these (they are non-interactive when all flags are provided):
8686
../bin/kernel create -n ts-openai-cua -l typescript -t openai-computer-use
8787
../bin/kernel create -n ts-gemini-cua -l typescript -t gemini-computer-use
8888
../bin/kernel create -n ts-claude-agent-sdk -l typescript -t claude-agent-sdk
89-
../bin/kernel create -n ts-yutori-cua -l typescript -t yutori-computer-use
89+
../bin/kernel create -n ts-yutori-cua -l typescript -t yutori
9090

9191
# Python templates
9292
../bin/kernel create -n py-sample-app -l python -t sample-app
@@ -97,7 +97,7 @@ Run each of these (they are non-interactive when all flags are provided):
9797
../bin/kernel create -n py-openagi-cua -l python -t openagi-computer-use
9898
../bin/kernel create -n py-claude-agent-sdk -l python -t claude-agent-sdk
9999
../bin/kernel create -n py-gemini-cua -l python -t gemini-computer-use
100-
../bin/kernel create -n py-yutori-cua -l python -t yutori-computer-use
100+
../bin/kernel create -n py-yutori-cua -l python -t yutori
101101
```
102102

103103
## Step 5: Deploy Each Template

pkg/create/templates.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ const (
1818
TemplateStagehand = "stagehand"
1919
TemplateOpenAGIComputerUse = "openagi-computer-use"
2020
TemplateClaudeAgentSDK = "claude-agent-sdk"
21-
TemplateYutoriComputerUse = "yutori-computer-use"
21+
TemplateYutoriComputerUse = "yutori"
2222
)
2323

2424
type TemplateInfo struct {

pkg/templates/python/anthropic-computer-use/loop.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,7 @@ class APIProvider(StrEnum):
5050
* As the initial step click on the search bar.
5151
* When viewing a page it can be helpful to zoom out so that you can see everything on the page.
5252
* Either that, or make sure you scroll down to see everything before deciding something isn't available.
53+
* Scroll action: scroll_amount and the tool result are in wheel units (not pixels).
5354
* When using your computer function calls, they take a while to run and send back to you.
5455
* Where possible/feasible, try to chain multiple of these calls all into one function calls request.
5556
* The current date is {datetime.now().strftime("%A, %B %d, %Y")}.

pkg/templates/python/anthropic-computer-use/tools/computer.py

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -370,21 +370,17 @@ async def __call__(
370370
else:
371371
x, y = self._last_mouse_position
372372

373-
# Each scroll_amount unit = 1 scroll wheel click ≈ 120 pixels (matches Anthropic's xdotool behavior)
374-
scroll_factor = scroll_amount * 120
375-
373+
notches = max(scroll_amount or 1, 1)
376374
delta_x = 0
377375
delta_y = 0
378376
if scroll_direction == "up":
379-
delta_y = -scroll_factor
377+
delta_y = -notches
380378
elif scroll_direction == "down":
381-
delta_y = scroll_factor
379+
delta_y = notches
382380
elif scroll_direction == "left":
383-
delta_x = -scroll_factor
381+
delta_x = -notches
384382
elif scroll_direction == "right":
385-
delta_x = scroll_factor
386-
387-
print(f"Scrolling {abs(delta_x) if delta_x != 0 else abs(delta_y)} pixels {scroll_direction}")
383+
delta_x = notches
388384

389385
self.kernel.browsers.computer.scroll(
390386
id=self.session_id,
@@ -393,7 +389,12 @@ async def __call__(
393389
delta_x=delta_x,
394390
delta_y=delta_y,
395391
)
396-
return await self.screenshot()
392+
393+
await asyncio.sleep(0.2)
394+
screenshot_result = await self.screenshot()
395+
return screenshot_result.replace(
396+
output=f"Scrolled {notches} wheel unit(s) {scroll_direction}."
397+
)
397398

398399
if action in ("hold_key", "wait"):
399400
if duration is None or not isinstance(duration, (int, float)):

pkg/templates/python/gemini-computer-use/main.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -75,9 +75,8 @@ async def cua_task(
7575
}
7676

7777

78-
# Run locally if executed directly (not imported as a module)
79-
# Execute via: uv run main.py
80-
if __name__ == "__main__":
78+
# Run locally when not in Kernel invocation. Execute via: uv run main.py
79+
if __name__ == "__main__" and not os.getenv("KERNEL_INVOCATION"):
8180
import asyncio
8281

8382
async def main():

pkg/templates/python/gemini-computer-use/tools/computer.py

Lines changed: 16 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@
2121

2222
TYPING_DELAY_MS = 12
2323
SCREENSHOT_DELAY_SECS = 0.5
24+
PX_PER_NOTCH = 60
25+
MAX_NOTCHES_PER_ACTION = 17
2426

2527

2628
class ComputerTool:
@@ -131,22 +133,21 @@ async def execute_action(
131133
elif action_name == GeminiAction.SCROLL_DOCUMENT:
132134
if "direction" not in args:
133135
return ToolResult(error="scroll_document requires direction")
134-
# Scroll at center of viewport
135136
center_x = self.screen_size.width // 2
136137
center_y = self.screen_size.height // 2
137-
scroll_delta = 500
138138

139-
delta_x, delta_y = 0, 0
139+
magnitude_px = args.get("magnitude", 400)
140+
doc_notches = min(MAX_NOTCHES_PER_ACTION, max(1, round(magnitude_px / PX_PER_NOTCH)))
140141
direction = args["direction"]
142+
delta_x = delta_y = 0
141143
if direction == "down":
142-
delta_y = scroll_delta
144+
delta_y = doc_notches
143145
elif direction == "up":
144-
delta_y = -scroll_delta
146+
delta_y = -doc_notches
145147
elif direction == "right":
146-
delta_x = scroll_delta
148+
delta_x = doc_notches
147149
elif direction == "left":
148-
delta_x = -scroll_delta
149-
150+
delta_x = -doc_notches
150151
self.kernel.browsers.computer.scroll(
151152
self.session_id,
152153
x=center_x,
@@ -164,24 +165,18 @@ async def execute_action(
164165
x = self.denormalize_x(args["x"])
165166
y = self.denormalize_y(args["y"])
166167

167-
# Denormalize magnitude if provided
168-
magnitude = args.get("magnitude", 800)
168+
magnitude_px = args.get("magnitude", 400)
169+
notches = min(MAX_NOTCHES_PER_ACTION, max(1, round(magnitude_px / PX_PER_NOTCH)))
169170
direction = args["direction"]
170-
if direction in ("up", "down"):
171-
magnitude = self.denormalize_y(magnitude)
172-
else:
173-
magnitude = self.denormalize_x(magnitude)
174-
175-
delta_x, delta_y = 0, 0
171+
delta_x = delta_y = 0
176172
if direction == "down":
177-
delta_y = magnitude
173+
delta_y = notches
178174
elif direction == "up":
179-
delta_y = -magnitude
175+
delta_y = -notches
180176
elif direction == "right":
181-
delta_x = magnitude
177+
delta_x = notches
182178
elif direction == "left":
183-
delta_x = -magnitude
184-
179+
delta_x = -notches
185180
self.kernel.browsers.computer.scroll(
186181
self.session_id,
187182
x=x,

pkg/templates/python/openagi-computer-use/kernel_handler.py

Lines changed: 21 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -36,13 +36,16 @@ class KernelActionHandler:
3636
- HOTKEY -> press_key(keys=[...])
3737
- TYPE -> type_text(text=...)
3838
- SCROLL -> scroll(x, y, delta_y=...)
39+
40+
Note: OpenAGI/Lux tends to emit scroll N times for "scroll by N" (e.g. 3 identical
41+
[scroll] actions for "scroll down with amount 3"). We treat each scroll event as
42+
one scroll unit (1 notch), so N events in a row = N notches without fighting the model.
3943
"""
4044

4145
def __init__(
4246
self,
4347
session: "KernelBrowserSession",
4448
action_pause: float = 0.1,
45-
scroll_amount: int = 100,
4649
wait_duration: float = 1.0,
4750
type_delay: int = 50,
4851
):
@@ -52,13 +55,11 @@ def __init__(
5255
Args:
5356
session: The Kernel browser session to control
5457
action_pause: Pause between actions in seconds
55-
scroll_amount: Amount to scroll (pixels)
5658
wait_duration: Duration for wait actions in seconds
5759
type_delay: Delay between keystrokes in milliseconds
5860
"""
5961
self.session = session
6062
self.action_pause = action_pause
61-
self.scroll_amount = scroll_amount
6263
self.wait_duration = wait_duration
6364
self.type_delay = type_delay
6465

@@ -239,21 +240,25 @@ def _execute_hotkey(self, keys: list[str]):
239240
keys=keys,
240241
)
241242

242-
def _execute_scroll(self, x: int, y: int, direction: str):
243+
def _execute_scroll(self, x: int, y: int, direction: str, notches: int = 1):
243244
"""Execute a scroll action."""
244-
# Move to position first
245-
self.session.kernel.browsers.computer.move_mouse(
246-
id=self.session.session_id,
247-
x=x,
248-
y=y,
249-
)
250-
# Scroll in the specified direction
251-
delta_y = self.scroll_amount if direction == "up" else -self.scroll_amount
245+
notches = max(notches, 1)
246+
delta_x = 0
247+
delta_y = 0
248+
if direction == "up":
249+
delta_y = -notches
250+
elif direction == "down":
251+
delta_y = notches
252+
elif direction == "left":
253+
delta_x = -notches
254+
elif direction == "right":
255+
delta_x = notches
256+
252257
self.session.kernel.browsers.computer.scroll(
253258
id=self.session.session_id,
254259
x=x,
255260
y=y,
256-
delta_x=0,
261+
delta_x=delta_x,
257262
delta_y=delta_y,
258263
)
259264

@@ -298,7 +303,7 @@ def _execute_single_action(self, action: Action) -> None:
298303

299304
case ActionType.SCROLL:
300305
x, y, direction = self._parse_scroll(arg)
301-
self._execute_scroll(x, y, direction)
306+
self._execute_scroll(x, y, direction, notches=1)
302307

303308
case ActionType.FINISH:
304309
# Task completion - nothing to do
@@ -316,32 +321,23 @@ def _execute_single_action(self, action: Action) -> None:
316321
print(f"Unknown action type: {action.type}")
317322

318323
def _execute_action(self, action: Action) -> None:
319-
"""Execute an action, potentially multiple times."""
324+
"""Execute an action, potentially multiple times. SCROLL: each event = 1 notch."""
320325
count = action.count or 1
321-
322326
for _ in range(count):
323327
self._execute_single_action(action)
324-
# Small pause between repeated actions
325328
if count > 1:
326329
time.sleep(self.action_pause)
327330

328331
async def __call__(self, actions: list[Action]) -> None:
329-
"""
330-
Execute a list of actions.
331-
332-
Args:
333-
actions: List of Action objects to execute
334-
"""
332+
"""Execute a list of actions."""
335333
if not self.session.session_id:
336334
raise RuntimeError("Browser session not initialized")
337335

338336
for action in actions:
339337
try:
340-
# Run the synchronous action execution in a thread pool
341338
await asyncio.get_event_loop().run_in_executor(
342339
None, self._execute_action, action
343340
)
344-
# Pause between actions
345341
await asyncio.sleep(self.action_pause)
346342
except Exception as e:
347343
print(f"Error executing action {action.type}: {e}")
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)