AI agent that browses the web hands-free based on your conversational voice commands, executing real actions mid-conversation. Built for the Mistral Worldwide Hackathon — Online Edition
| Scenario | Action | What Happens |
|---|---|---|
| Web Search | "Search Wikipedia for Artificial Intelligence" | VoxiBrowse navigates to Wikipedia, types in the search query, hits enter, and reads out the top result summary. |
| Form Filling | "Enter my name as Muhammad and select priority shipping" | Locates the specific form inputs, types the name, and interacts with the dropdown to select the right shipping option. |
| Scrolling & Navigation | "Scroll down and click on the second article" | Evaluates the visible elements, scrolls the page down, identifies the new DOM state, and clicks the requested element. |
| Interruption | "Actually, cancel that and go to YouTube" | Instantly stops its current TTS narration and browser action, and pivots to load the new URL. |
The killer feature is the custom DOM extraction engine. Instead of relying on slow, expensive coordinate-based vision models, VoxiBrowse injects a lightweight JS extractor that maps all visible, interactive elements to numeric IDs (e.g., [42] "Search"). Mistral Large reads this structured text state and accurately interacts with elements by their designated IDs.
Microphone
│ (PyAudio PCM 16kHz)
▼
ElevenLabs STT (Scribe WebSocket)
│
▼
Mistral Large (Reasoning & Tool Calling)
│
┌─────┴─────┐
▼ ▼
Playwright ElevenLabs TTS
(Browser & DOM) (Audio Synthesis)
│ │
▼ ▼
Web Page Speaker (PyAudio)
- Custom DOM Addressing: Injects
dom_extractor.jsinto every page to identify visible, interactive elements and assign them sequentialdata-agent-idnumbers. The LLM then accurately triggers Playwright interactions using these safe numeric IDs. - Agentic Tool Loop: Mistral Large is strictly constrained to one browser action per turn. This prevents batch-action desync (where clicking an element might change the DOM and invalidate the IDs of subsequent actions).
- Real-Time Interruption: The async event loop continuously monitors ElevenLabs STT for partial transcripts. If the user speaks while the agent is narrating, TTS playback stops instantly, the LLM call is cancelled, and the conversational context rolls back to process the new command smoothly.
- Async I/O: The entire system is built on Python's
asyncio. Audio streaming to STT, TTS playback chunks, and Playwright execution all happen concurrently without blocking the main event loops.
| Component | Technology | Role |
|---|---|---|
| LLM | Mistral Large (mistral-large-latest) |
Reasoning core and JSON functional tool-calling |
| Speech-to-Text | ElevenLabs Scribe | Real-time audio streaming and VAD transcription via WebSocket |
| Text-to-Speech | ElevenLabs (eleven_flash_v2_5) |
Low-latency voice synthesis for natural conversations |
| Browser Engine | Playwright (Chromium) | Headful browser automation and JS evaluation |
| Audio Processing | PyAudio | Direct capture of microphone and hardware speaker routing |
| Runtime | Python asyncio |
High-performance non-blocking concurrent event loop |
voxibrowse/
├── run.py # Main CLI point of entry
├── requirements.txt # Dependencies
├── .env # Configuration & API Keys
└── backend/
├── cli_session.py # Orchestrates Mic -> STT -> LLM -> Tools -> TTS
├── llm_client.py # Mistral wrapper, tracks action history & loop
├── browser_agent.py # Manages Playwright, Navigation, and DOM injection
├── dom_extractor.js # JS script to map interactive elements to numeric IDs
├── tools.py # Tool schemas strictly formatted for Mistral
├── stt_client.py # ElevenLabs Realtime STT connection
├── tts_client.py # ElevenLabs TTS streaming generator
├── audio_utils.py # PCM binary utilities
└── config.py # Environment settings and hardware sample rates
-
Clone the repository
git clone https://github.com/muhammadehsan02/voxibrowse.git cd voxibrowse -
Install Python dependencies
pip install -r requirements.txt
-
Install Playwright Chromium Browser
python -m playwright install chromium
-
Environment Variables Create a
.envfile in the root directory:MISTRAL_API_KEY=your_mistral_api_key ELEVENLABS_API_KEY=your_elevenlabs_api_key
-
Run the Agent
python run.py
Say "hey browser" to wake the agent, and "just stop" to put it to sleep.
| Tool Name | Description |
|---|---|
answer_to_user |
Verbally speaks text to the user via TTS stream. |
click_element |
Auto-scrolls and clicks an interactive DOM element by its ID. |
type_text |
Instantly fills text into standard input fields. |
type_and_submit |
Types character-by-character to trigger JS debouncers/autocomplete, then hits Enter. |
select_option |
Picks a specified value from native <select> dropdowns or custom JS menus. |
scroll_down / scroll_up |
Dynamically scrolls the page viewport by a chosen intensity (1-5). |
go_to_url |
Navigates the active Playwright tab to a specified URL. |
get_iframe_content |
Reads and extracts the nested DOM of an embedded iframe. |
click_iframe_element |
Directly clicks an element isolated inside a specific iframe by ID. |
type_iframe_text |
Feeds text inputs to elements nested deep within iframes. |
- Listen: PyAudio constantly grabs 16kHz chunks and streams them into ElevenLabs Scribe over websockets.
- Think: A committed STT transcript triggers the agent.
browser_agent.pyevaluatesdom_extractor.jsto get a structured text representation of the current page. The transcript + page state are sent to Mistral Large. - Act: Mistral decides on exactly one browser action. Playwright executes it.
- Speak: Mistral simultaneously generates a short narration of its action, which is streamed through ElevenLabs TTS to PyAudio.
- Re-evaluate: The agent triggers
dom_extractor.jsagain to capture the new DOM, feeds it to Mistral, and repeats the process until the user's objective is fully complete.
Muhammad Ehsan — built for the Mistral Worldwide Hackathon
Powered by Mistral