-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Environment Information
Stagehand:
- Language/SDK: TypeScript
- Stagehand version: 3.0.8
AI Provider:
- Provider: Google
- Model:
google/gemini-2.5-computer-use-preview-10-2025
Issue Description
Sometimes the timing of screenshots is off when using .agent mode. This gets the agent very confused, because it thinks that some action it has taken did not work. For example comparing an empty input to another empty input, or comparing the filled input to a filled input.
Digging in, it does not seem like the system is consistently taking new screenshots to pass along to the LLM. I can see in google's logs that it is returning new individual steps, and while it does not show me the attached images, I can see that no new screenshot was taken between those steps.
I thought previously it had something to do with assumptions being made about screenshots being completed and not awaiting that process properly, but now I'm not so sure. Note that I am patching the screenshot method to capture pngs and inject masking locators, but this does not seem to make a difference.
It may or may not only be related to the initial screenshot, as I've noticed the issue when it gets stuck filling in usernames, which is the first action my agent is doing.
I've also noticed that it takes more screenshots than seem necessary, often doing several in very fast succession.
Repro
I've was playing with some extra artificial delay to make the problem worse. Again I would not expect this to cause any problems as the harness should be awaiting the screenshots being complete.
import { $ } from 'bun';
import { setTimeout as delay } from 'node:timers/promises';
import * as fs from 'fs/promises';
import { Stagehand } from '@browserbasehq/stagehand';
import { tool } from 'ai';
import { z } from 'zod/v3';
const sensitiveSelectors = [
'#dialog-input'
];
// clear screeshots dir
await $`mkdir -p debug-screenshots`;
await $`rm debug-screenshots/* || true`;
// set up simple web server to serve HTML page used for testing
const html = `
<!DOCTYPE html>
<html>
<head><title>Stagehand debugging</title></head>
<body>
<h2>Screenshot timing test</h2>
<form>
<input type="text" id="username" placeholder="Username" />
<input type="password" id="password" placeholder="Password" />
</form>
</body></html>
`;
const port = 6789;
const localhostUrl = `http://127.0.0.1:${port}/`;
const server = Bun.serve({
port,
fetch(req) {
return new Response(html, {
headers: { "Content-Type": "text/html" }
});
}
});
console.log(`Test server running at ${localhostUrl}`);
const stagehand = new Stagehand({
env: 'LOCAL',
experimental: true,
});
await stagehand.init();
let page = stagehand.context.pages()[0];
if (!page) throw new Error('Expected browser page');
// patch page.screenshot to save screenshots to debug-screenshots dir
let ssCount = 1;
const oldScreenshot = page.screenshot;
page.screenshot = async function (options?: Parameters<typeof page.screenshot>[0]) {
console.log(`[${new Date().toISOString()}] >> Captured screenshot`);
options ??= {};
options.mask = sensitiveSelectors.map((s) => page.locator(s));
const screenshot = await oldScreenshot.call(this, options);
await fs.writeFile(`debug-screenshots/screenshot-${ssCount++}.png`, screenshot);
// await delay(5000); // EXTRA ARTIFICIAL DELAY
return screenshot;
};
await page.goto(localhostUrl);
const agent = stagehand.agent({
mode: 'cua',
model: {
modelName: 'google/gemini-2.5-computer-use-preview-10-2025',
apiKey: process.env.GOOGLE_AI_STUDIO_API_KEY,
},
tools: {
fillUsername: tool({
description: 'Fill in the username field',
inputSchema: z.object({}),
execute: async () => {
return await page.locator('#username').fill('test@test.com');
},
}),
fillPassword: tool({
description: 'Fill in the password field',
inputSchema: z.object({}),
execute: async () => {
return await page.locator('#password').fill('test123');
},
}),
}
});
await agent.execute({
instruction: 'Enter username password (using tools), verify they were filled',
});
await stagehand.close();
await server.stop();And here is some log output after I totally removed the patching and you can see the problem is still happening. It does figure it out eventually, but look at the progression of steps and when screenshots are taken.
[2026-01-28 17:15:20.574 -0800] INFO: Executing step 1/10
category: "agent"
[2026-01-28 17:15:22.614 -0800] INFO: Executing step 2/10
category: "agent"
[2026-01-28 17:15:25.431 -0800] INFO: Reasoning: I have evaluated the screenshot. It shows a webpage with two input fields: "Username" and "Password".
The task is to enter username and password using the tools and verify they were filled.
First, I will fill in the username.
category: "agent"
[2026-01-28 17:15:25.432 -0800] INFO: Executing custom tool: fillUsername with args: {}
category: "agent"
[2026-01-28 17:15:25.455 -0800] INFO: Tool fillUsername completed successfully. Result: undefined
category: "agent"
[2026-01-28 17:15:25.455 -0800] INFO: Executing step 3/10
category: "agent"
[2026-01-28 17:15:29.146 -0800] INFO: Reasoning: I have evaluated the last action. I used `fillUsername` to attempt to fill the username.
However, looking at the current screenshot, the "Username" field (around y=110, x=60) is *still empty*. This means the `fillUsername` tool did not automatically type anything.
The `fillUsername` and `fillPassword` tools likely just *enable* me to type into those fields or perhaps pre-fill them with some default credentials if configured. Since they didn't visibly change the page, I will now *type* into the username field. I will type "testuser".
category: "agent"
[2026-01-28 17:15:29.970 -0800] INFO: Capturing screenshot
category: "agent"
[2026-01-28 17:15:31.031 -0800] INFO: Capturing screenshot
category: "agent"
[2026-01-28 17:15:32.103 -0800] INFO: Capturing screenshot
category: "agent"
[2026-01-28 17:15:33.474 -0800] INFO: Capturing screenshot
category: "agent"
[2026-01-28 17:15:33.584 -0800] INFO: Executing step 4/10
category: "agent"
[2026-01-28 17:15:38.553 -0800] INFO: Reasoning: I have evaluated the last action. I attempted to type "testuser" into the username field at (y=110, x=60).
The current screenshot shows that "testuser" has been successfully entered into the first field.
Now I will attempt to fill the password field using `fillPassword`.
category: "agent"
[2026-01-28 17:15:38.553 -0800] INFO: Executing custom tool: fillPassword with args: {}
category: "agent"
[2026-01-28 17:15:38.562 -0800] INFO: Tool fillPassword completed successfully. Result: undefined
category: "agent"
[2026-01-28 17:15:38.562 -0800] INFO: Executing step 5/10
category: "agent"
[2026-01-28 17:15:42.867 -0800] INFO: Reasoning: I have evaluated the last action. I used `fillPassword`.
Again, looking at the current screenshot, the "Password" field (around y=110, x=178) is still empty.
Similar to `fillUsername`, `fillPassword` likely didn't type anything.
I will now type "testpass" into the password field.
category: "agent"
[2026-01-28 17:15:43.684 -0800] INFO: Capturing screenshot
category: "agent"
[2026-01-28 17:15:44.743 -0800] INFO: Capturing screenshot
category: "agent"
[2026-01-28 17:15:45.817 -0800] INFO: Capturing screenshot
category: "agent"
[2026-01-28 17:15:47.191 -0800] INFO: Capturing screenshot
category: "agent"
[2026-01-28 17:15:47.301 -0800] INFO: Executing step 6/10
category: "agent"
[2026-01-28 17:15:51.983 -0800] INFO: Reasoning: I have evaluated the last action. I attempted to type "testpass" into the password field at (y=110, x=178).
The current screenshot shows that the password field now contains dots (••••••••), indicating that "testpass" has been successfully entered (it's masked as a password).
Both username and password fields appear to be filled.
Username and password fields have been filled.