A universal web navigation agent powered by Gemini 1.5 Flash and Playwright. This agent can navigate websites, interact with elements, and achieve goals based on voice or text commands.
- Visual Reasoning: Uses Gemini 1.5 Flash to "see" screenshots and decide on actions.
- Voice Control: Supports spoken goals via speech recognition.
- Set-of-Mark (SoM) Tagging: Automatically identifies and numbers interactive elements on the screen for the AI.
- Cross-Platform: Works on any website.
- Python 3.8+
- Git
- Playwright
-
Clone the repository:
git clone <your-repo-url> cd agent
-
Install dependencies:
pip install -r requirements.txt playwright install chromium
-
Configure Environment Variables: Create a
.envfile in theagentdirectory:GEMINI_API_KEY=your_api_key_here
Run the main script:
python main.pySpeak your goal when prompted (e.g., "Go to Wikipedia and search for quantum computing").
- Google Gemini 1.5 Flash: Vision-language model for reasoning.
- Playwright: Browser automation.
- SpeechRecognition: Voice-to-text functionality.
- Pillow: Image processing.