Skip to content

Feat/windows command#108

Open
SiswoHandoko wants to merge 2 commits intogoogle-gemini:mainfrom
eryzerz:feat/windows-command
Open

Feat/windows command#108
SiswoHandoko wants to merge 2 commits intogoogle-gemini:mainfrom
eryzerz:feat/windows-command

Conversation

@SiswoHandoko
Copy link

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @SiswoHandoko, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the agent's capabilities by enabling it to interact with local desktop environments on both macOS and Windows. This allows the agent to perform tasks directly on the user's operating system, moving beyond browser-only interactions. The changes involve integrating new computer control classes, adapting the agent's core logic for desktop-specific instructions and tool usage, and updating the project's documentation and dependencies to support this new functionality.

Highlights

  • New Desktop Environments: Introduced two new experimental environments, desktop for macOS and desktop_win for Windows, allowing the agent to control local desktop GUIs using OS-level mouse/keyboard automation.
  • Agent Adaptation for Desktop Control: The agent now dynamically adjusts its behavior based on the chosen environment, excluding web-specific functions and providing tailored system instructions for macOS and Windows desktop interactions.
  • Documentation and CLI Updates: The README.md has been updated to reflect the new desktop environments, including usage examples and modifications to the command-line argument table. The main.py script now supports these new --env choices.
  • New Dependencies: Added pyautogui, pillow, and pyscreeze to requirements.txt to enable the desktop automation capabilities.
Changelog
  • README.md
    • Added desktop and desktop_win to the list of available environments.
    • Updated the command-line arguments table to include the new desktop environments for the --env flag.
    • Included new sections with usage examples for 'Local Desktop (experimental)' and 'Local Desktop Windows (experimental)'.
  • agent.py
    • Modified the BrowserAgent constructor to accept an environment parameter.
    • Implemented logic to exclude web-related functions (open_web_browser, search, navigate) when operating in desktop environments.
    • Added environment-specific system instructions for macOS and Windows desktop control to guide the agent's reasoning.
    • Passed the generated system_instruction to the genai.GenerativeModel initialization.
  • computers/init.py
    • Imported DesktopComputer and DesktopWindowsComputer from their respective modules.
    • Added DesktopComputer and DesktopWindowsComputer to the __all__ list for module export.
  • computers/desktop/desktop.py
    • Added a new DesktopComputer class that provides methods for controlling a macOS desktop using pyautogui.
    • Implemented core desktop interaction functions such as click_at, hover_at, type_text_at, scroll_document, scroll_at, key_combination, and drag_and_drop.
    • Included platform-specific key normalization for macOS commands.
  • computers/desktop/desktop_win.py
    • Added a new DesktopComputer class (aliased as DesktopWindowsComputer) for controlling a Windows desktop using pyautogui.
    • Implemented similar desktop interaction functions as the macOS version, with Windows-specific key normalizations.
    • Included a runtime check to ensure this environment is only used on Windows (win32).
  • main.py
    • Imported the newly added DesktopComputer and DesktopWindowsComputer classes.
    • Updated the --env command-line argument choices to include desktop and desktop_win.
    • Added conditional logic to instantiate the appropriate Computer subclass based on the --env argument.
    • Passed the selected environment argument to the BrowserAgent constructor.
  • requirements.txt
    • Added pyautogui for programmatic control of the mouse and keyboard.
    • Added pillow for image processing, specifically for screenshots.
    • Added pyscreeze for screenshot capabilities.
  • test_main.py
    • Added test_main_desktop to verify the correct instantiation of DesktopComputer when --env desktop is used.
    • Added test_main_desktop_win to verify the correct instantiation of DesktopWindowsComputer when --env desktop_win is used.
Activity
  • The pull request was authored by SiswoHandoko.
  • New desktop automation capabilities for macOS and Windows have been introduced.
  • The agent's core logic has been updated to support these new environments.
  • Documentation and dependencies have been adjusted to reflect the new features.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@google-cla
Copy link

google-cla bot commented Feb 10, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for desktop automation on both macOS and Windows by adding desktop and desktop_win environments, involving new DesktopComputer classes and agent logic updates. However, this introduces a critical security vulnerability: Remote Code Execution (RCE) via prompt injection, as the agent is given excessive permissions to control the host system's GUI based on untrusted user input without proper safeguards. Beyond this critical security concern, the changes also present maintainability issues due to significant code duplication between platform-specific computer classes, areas for code cleanup in the agent logic, and a minor documentation inconsistency in the README.

Comment on lines +101 to +133
system_instruction = None
if self._environment == "desktop":
system_instruction = (
"You are controlling a local desktop GUI (macOS). "
"Do not type into the current active app unless you have explicitly focused the correct input field. "
"To open apps, first open Spotlight with Command+Space using key_combination, "
"then type the app name and press Enter without clicking in other windows. "
"Prefer keyboard shortcuts to switch apps instead of typing into arbitrary windows."
)
search_intent = any(
token in self._query.lower()
for token in ("find", "search", "look for", "locate")
)
if search_intent:
system_instruction += (
" If the user asks to find or search for something, open Spotlight first."
)
elif self._environment == "desktop_win":
system_instruction = (
"You are controlling a local desktop GUI (Windows). "
"Do not type into the current active app unless you have explicitly focused the correct input field. "
"To open apps, first open Start/Search with Win or Win+S using key_combination, "
"then type the app name and press Enter without clicking in other windows. "
"Prefer keyboard shortcuts to switch apps instead of typing into arbitrary windows."
)
search_intent = any(
token in self._query.lower()
for token in ("find", "search", "look for", "locate")
)
if search_intent:
system_instruction += (
" If the user asks to find or search for something, open Start/Search first."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

The introduction of desktop automation capabilities (macOS and Windows) via pyautogui creates a critical security risk. The agent's actions, driven by the query parameter (untrusted user input), allow for Remote Code Execution (RCE) via prompt injection, potentially enabling an attacker to take control of the host system. This is exacerbated by duplicated logic for handling desktop and desktop_win environments, specifically the search_intent check and system_instruction appending, which could be refactored for better maintainability and to simplify the application of security mitigations.

To mitigate the RCE risk, consider:

  1. Implementing mandatory human-in-the-loop (HITL) confirmation for sensitive actions.
  2. Running the agent in a highly restricted, sandboxed environment.
  3. Implementing a strict allow-list of permitted actions.
  4. Sanitizing and validating all LLM-generated output before execution.

Additionally, refactoring the duplicated logic for desktop and desktop_win environments would improve maintainability and reduce redundancy.

Comment on lines +51 to +161
def type_text_at(
self,
x: int,
y: int,
text: str,
press_enter: bool = False,
clear_before_typing: bool = True,
) -> EnvState:
if self._spotlight_pending:
self._spotlight_pending = False
else:
pyautogui.click(x, y)
if clear_before_typing:
if sys.platform == "darwin":
pyautogui.hotkey("command", "a")
else:
pyautogui.hotkey("ctrl", "a")
pyautogui.press("backspace")
pyautogui.write(text)
if press_enter:
pyautogui.press("enter")
return self.current_state()

def scroll_document(
self, direction: Literal["up", "down", "left", "right"]
) -> EnvState:
scroll_amount = self._screen_size[1] // 2
if direction == "up":
pyautogui.scroll(scroll_amount)
elif direction == "down":
pyautogui.scroll(-scroll_amount)
elif direction == "left":
pyautogui.hscroll(-scroll_amount)
elif direction == "right":
pyautogui.hscroll(scroll_amount)
else:
raise ValueError("Unsupported direction: ", direction)
return self.current_state()

def scroll_at(
self,
x: int,
y: int,
direction: Literal["up", "down", "left", "right"],
magnitude: int = 800,
) -> EnvState:
pyautogui.moveTo(x, y)
if direction == "up":
pyautogui.scroll(magnitude)
elif direction == "down":
pyautogui.scroll(-magnitude)
elif direction == "left":
pyautogui.hscroll(-magnitude)
elif direction == "right":
pyautogui.hscroll(magnitude)
else:
raise ValueError("Unsupported direction: ", direction)
return self.current_state()

def wait_5_seconds(self) -> EnvState:
time.sleep(5)
return self.current_state()

def go_back(self) -> EnvState:
if sys.platform == "darwin":
pyautogui.hotkey("command", "[")
else:
pyautogui.hotkey("alt", "left")
return self.current_state()

def go_forward(self) -> EnvState:
if sys.platform == "darwin":
pyautogui.hotkey("command", "]")
else:
pyautogui.hotkey("alt", "right")
return self.current_state()

def search(self) -> EnvState:
return self.navigate(self._search_engine_url)

def navigate(self, url: str) -> EnvState:
normalized_url = url
if not normalized_url.startswith(("http://", "https://")):
normalized_url = "https://" + normalized_url
if sys.platform == "darwin":
pyautogui.hotkey("command", "l")
else:
pyautogui.hotkey("ctrl", "l")
pyautogui.write(normalized_url)
pyautogui.press("enter")
self._current_url = normalized_url
time.sleep(1)
return self.current_state()

def key_combination(self, keys: list[str]) -> EnvState:
normalized_keys = [self._normalize_key(key) for key in keys]
if len(normalized_keys) == 1:
pyautogui.press(normalized_keys[0])
else:
pyautogui.hotkey(*normalized_keys)
if sys.platform == "darwin" and normalized_keys == ["command", "space"]:
self._spotlight_pending = True
time.sleep(0.2)
return self.current_state()

def drag_and_drop(
self, x: int, y: int, destination_x: int, destination_y: int
) -> EnvState:
pyautogui.moveTo(x, y)
pyautogui.dragTo(destination_x, destination_y, button="left")
return self.current_state()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The DesktopComputer class provides the LLM with excessive permissions by allowing it to perform OS-level input automation via pyautogui. This includes typing arbitrary text (type_text_at), navigating to arbitrary URLs (navigate), and pressing any key combination (key_combination). When combined with an LLM that processes untrusted user input, this tool can be abused to compromise the host system.

Consider restricting the tool's capabilities to the minimum necessary for the intended task and ensuring that all actions are performed in a secure, isolated environment.

Comment on lines +55 to +153
def type_text_at(
self,
x: int,
y: int,
text: str,
press_enter: bool = False,
clear_before_typing: bool = True,
) -> EnvState:
if self._launcher_pending:
self._launcher_pending = False
else:
pyautogui.click(x, y)
if clear_before_typing:
pyautogui.hotkey("ctrl", "a")
pyautogui.press("backspace")
pyautogui.write(text)
if press_enter:
pyautogui.press("enter")
return self.current_state()

def scroll_document(
self, direction: Literal["up", "down", "left", "right"]
) -> EnvState:
scroll_amount = self._screen_size[1] // 2
if direction == "up":
pyautogui.scroll(scroll_amount)
elif direction == "down":
pyautogui.scroll(-scroll_amount)
elif direction == "left":
pyautogui.hscroll(-scroll_amount)
elif direction == "right":
pyautogui.hscroll(scroll_amount)
else:
raise ValueError("Unsupported direction: ", direction)
return self.current_state()

def scroll_at(
self,
x: int,
y: int,
direction: Literal["up", "down", "left", "right"],
magnitude: int = 800,
) -> EnvState:
pyautogui.moveTo(x, y)
if direction == "up":
pyautogui.scroll(magnitude)
elif direction == "down":
pyautogui.scroll(-magnitude)
elif direction == "left":
pyautogui.hscroll(-magnitude)
elif direction == "right":
pyautogui.hscroll(magnitude)
else:
raise ValueError("Unsupported direction: ", direction)
return self.current_state()

def wait_5_seconds(self) -> EnvState:
time.sleep(5)
return self.current_state()

def go_back(self) -> EnvState:
pyautogui.hotkey("alt", "left")
return self.current_state()

def go_forward(self) -> EnvState:
pyautogui.hotkey("alt", "right")
return self.current_state()

def search(self) -> EnvState:
return self.navigate(self._search_engine_url)

def navigate(self, url: str) -> EnvState:
normalized_url = url
if not normalized_url.startswith(("http://", "https://")):
normalized_url = "https://" + normalized_url
pyautogui.hotkey("ctrl", "l")
pyautogui.write(normalized_url)
pyautogui.press("enter")
self._current_url = normalized_url
time.sleep(1)
return self.current_state()

def key_combination(self, keys: list[str]) -> EnvState:
normalized_keys = [self._normalize_key(key) for key in keys]
if len(normalized_keys) == 1:
pyautogui.press(normalized_keys[0])
else:
pyautogui.hotkey(*normalized_keys)
if normalized_keys in (["win"], ["win", "s"]):
self._launcher_pending = True
time.sleep(0.2)
return self.current_state()

def drag_and_drop(
self, x: int, y: int, destination_x: int, destination_y: int
) -> EnvState:
pyautogui.moveTo(x, y)
pyautogui.dragTo(destination_x, destination_y, button="left")
return self.current_state()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The DesktopComputer class for Windows provides the LLM with excessive permissions via pyautogui, similar to the macOS version. This tool can be abused via prompt injection to perform malicious actions on the host Windows system.

Consider implementing strict safeguards and running the agent in a restricted environment.

from ..computer import Computer, EnvState


class DesktopComputer(Computer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This file is almost an exact copy of computers/desktop/desktop.py. This significant code duplication will make future maintenance difficult, as changes will need to be applied in two places.

A better approach would be to create a common base class, say BaseDesktopComputer, that contains all the shared logic. Then, DesktopComputer (for macOS) and DesktopWindowsComputer can inherit from this base class and override only the methods that have platform-specific implementations (like _normalize_key, go_back, navigate, etc.). This will greatly improve code reuse and maintainability.

| Argument | Description | Required | Default | Supported Environment(s) |
| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------- | -------- | ----------------------------------------- | ------------------------ |
| `--query` | The natural language query for the browser agent to execute. | Yes | N/A | All |
| `--env` | The computer use environment to use. Must be one of the following: `playwright`, `browserbase`, `desktop`, or `desktop_win` | No | N/A | All |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documentation for the --env argument is missing its default value. The table shows "N/A", but the code in main.py specifies default="playwright". To improve clarity for users, the default value should be documented here.

Suggested change
| `--env` | The computer use environment to use. Must be one of the following: `playwright`, `browserbase`, `desktop`, or `desktop_win` | No | N/A | All |
| `--env` | The computer use environment to use. Must be one of the following: `playwright`, `browserbase`, `desktop`, or `desktop_win` | No | `playwright` | All |
References
  1. In documentation, command-line arguments that have a default value should be marked as not required and their default value should be specified.

from ..computer import Computer, EnvState


class DesktopComputer(Computer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The class is named DesktopComputer, which is the same as the class in computers/desktop/desktop.py. This is confusing and can lead to errors. Since it's imported as DesktopWindowsComputer in computers/__init__.py, it would be clearer and more consistent to rename the class itself to DesktopWindowsComputer.

Suggested change
class DesktopComputer(Computer):
class DesktopWindowsComputer(Computer):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants